FacebookFacebook group TwitterTwitter
ICVSS Computer Vision: a Renaissance

Video Understanding - an Egocentric Perspective

Dima Damen

University of Bristol, UK

Abstract

This talk argues for a fine(r)-grained perspective on video understanding, using unscripted sequences captured from an egocentric perspective (i.e. first-person footage). Different from YouTube Datasets (e.g. Kinetics) or Sports-based ones (e.g. UCF), egocentric footage captures subtle hand-object interactions representing everyone's daily routines. I will cover the motivation for capturing long untrimmed multimodal videos (e.g. EPIC-KITCHENS, EGO4D), and what opportunities and challenges these bring to video understanding. I will then present approaches for multi-modal fusion in egocentric videos using vision, audio and language [CVPR 2021, CVPR 2020, ICCV 2019, ICASSP 2021] for the tasks of recognition, retrieval and segmentations. Importantly, I will argue for the need to tackle new tasks (e.g. skill understanding - CVPR 2019, CVPR 2020), new approaches of supervision (e.g. single timestamps - CVPR 2019) and new models (e.g. UnweaveNet - CVPR 2022) as we attempt to learn from longer unscripted videos.