Fourteenth International Workshop on Assistive Computer Vision and Robotics

8th September 2026 - AM, Malmö (Sweden)

Dima Damen, University of Bristol, UK https://dimadamen.github.io/

Dima Damen is a Professor of Computer Vision at the University of Bristol and Senior Research Scientist at Google DeepMind. Dima is currently an EPSRC Fellow (2020-2026), focusing her research interests in the automatic understanding of object interactions, actions and activities using wearable visual (and depth) sensors. She is best known for her leading works in Egocentric Vision, and has also contributed to novel research questions including mono-to-3D, video object segmentation, assessing action completion, domain adaptation, skill/expertise determination from video sequences, discovering task-relevant objects, dual-domain and dual-time learning as well as multi-modal fusion using vision, audio and language. She is the project lead for EPIC-KITCHENS, the seminal dataset in egocentric vision, with accompanying open challenges and follow-up works: EPIC-Sounds, VISOR and EPIC Fields, as well as the recent HD-EPIC. She is part of the large-scale consortium effort Ego4D and Ego-Exo4D. At the University of Bristol, Dima leads the Machine Learning and Computer Vision (MaVi) lab. At Google DeepMind, Dima is part of the Vision team, led by Andrew Zisserman, focusing on video understanding research.

Title: What do we need to model about the human to be truly assistive?

Wearable devices are becoming increasingly prevalent, yet the promise of these devices being truly assistive has yet to materialise. For a model to be genuinely assistive, it must treat egocentric video as more than just a 2D stream; it must foresee these videos as partial observations of a dynamic 3D world where objects and environments remain "out of sight but not out of mind." In this talk, I will argue that truly multimodal assistive models require forecasting human behavior from audio-visual feeds and gaze-priming motions. I will review our recent work on HD-EPIC, which utilises new data collection and digital twin annotations to merge video understanding with 3D modeling. Using this data, I will showcase the current failures of Vision-Language Models (VLMs) in understanding perspectives outside the camera’s field of view—a task that is trivial for humans but remains a bottleneck for AI. I will also review works that attempt to forecasting prime&reach motion and track multiple objects around a 3D environment towards a 4D dynamic modelling of the human and their surrounding.

Katerina Fragkiadaki, Carnegie Mellon University, US https://www.cs.cmu.edu/~katef/

Katerina is JPMorgan Chase Associate Professor of Computer Science in the Machine Learning Department at Carnegie Mellon University. She works in Artificial Intelligence at the intersection of Computer Vision, Machine Learning, Language Understanding and Robotics. Prior to joining MLD's faculty she spent three wonderful years as a post doctoral researcher first at UC Berkeley working with Jitendra Malik and then at Google Research in Mountain View working with the video group. She completed her Ph.D. in GRASP, UPenn with Jianbo Shi. She did her undergraduate studies at the National Technical University of Athens and before that she was in Crete.

Title: Human-Robot Interaction and Collaboration through Sim-to-Real Learning and Guided Generative Planning

This talk explores recent advances in human-humanoid interaction and collaboration, spanning learning from human demonstrations to deployment of robust policies on real humanoid robots. We will present methods for learning interactive behaviors that enable humanoid robots to coordinate with human partners in dynamic environments. We will also discuss how language instructions can be grounded and executed on the fly through guided generative planning, using diffusion-based models that reason over future behaviors and environmental outcomes. By leveraging both differentiable and non-differentiable reward signals derived from natural language, these methods enable robots to adapt their actions in real time while maintaining physical feasibility and task consistency.