From Videos to 4D Worlds and Beyond

Angjoo Kanazawa

University of California at Berkeley, USA

Abstract

The world underlying images and videos is 3-dimensional and dynamic, i.e. 4D, with people interacting with each other, objects, and the underlying scene. Even in videos of a static scene, there is always the camera moving about in the 4D world. Accurately recovering this information is essential for building systems that can reason about and interact with the underlying scene, and has immediate applications in visual effects and creation of immersive digital worlds. However, disentangling this 4D world from a video is a particularly ill-posed inverse problem rife with fundamental ambiguities. In this short lecture, I will cover key techniques in 3D Human Mesh Recovery, and 3D tracking, which is necessary for 4D (space and time) analysis of people from videos in-the-wild. I will lay out the developments to our latest techniques to perceive 4D human motion in video, which disentangles the camera and the human motion from challenging in-the-wild videos with multiple people. I will use the other hour to shed light into the development of Neural Radiance Fields, and discuss it in the context of nerf.studio, a modular open-source framework for easily creating photorealistic 3D scenes and accelerating NeRF development. I will discuss our recent works, which highlight how language can be incorporated for editing and interacting with the recovered 3D scenes.