FacebookFacebook group TwitterTwitter
ICVSS From Perception to Action

Learning to Understand Video Through Language

Lorenzo Torresani

Facebook AI Research (FAIR), Meta, USA

Abstract

This lecture will survey recent approaches for training video understanding models from loosely-associated language descriptions, such as automatic speech transcriptions, narrations, text summaries, and wiki articles. These forms of supervision offer several advantages over traditional action labels in trimmed videos: 1) They are obtainable at large scale with zero or low cost, which makes it possible to train video models on collections of unprecedented size. 2) They are available for long-form sequences, such as instructional or how-to videos, therefore enabling the modeling of long-term dependencies necessary to capture high-level semantics, e.g., intent or procedural goals. 3) They are expressed in free-form text, which can be used to learn joint video-language embeddings for multimodal applications beyond classic action recognition, e.g., video captioning, text-to-video retrieval, video Q&A. However, compared to manually annotated labels in short clips, they also pose a variety of new challenges, including noise, weak supervision and temporal misalignment. For example, narrations obtained from speech are marred by transcription mistakes, may discuss something that is not visually demonstrated, or may describe an action before/after it is actually shown in the video. Similarly, a textual summary of a long video does not provide temporal grounding necessary for localization. I will review strategies and techniques to cope with these noisy and weak supervisory signals. Finally, I will discuss the applications of these video-language models to a variety of traditional computer vision problems and to novel groundbreaking technologies.