Learning to Perceive the 3D World from 2D Images

Vincent Sitzmann

Massachusetts Institute of Technology, USA

Abstract

From a single picture, humans reconstruct a mental representation of the underlying 3D scene that is incredibly rich in information such as shape, appearance, physical properties, purpose, how things would feel, smell, sound, etc. These mental representations allow us to understand, navigate, and interact with our environment in our everyday lives. We learn this from little supervision, mainly by interacting with our world and observing the world around us.

In this session, I will cover the basics of models that similarly aim to learn the underlying 3D structure of our world given only 2D observations. In these models, an inference module first reconstructs a feature representation of the underlying 3D scene given one or few 2D images. A differentiable renderer then maps that neural scene representation back to images. By equipping the model with the appropriate structure, we may train it end-to-end on only a dataset of images, while recovering rich 3D representations of the underlying scenes. We will consider current state-of-the-art approaches and investigate the relationship and gap between 3D representation learning and 2D representation learning. I will highlight applications of such models across graphics, vision, and robotics, and discuss which applications may profit most from this approach going forward. Finally, we will discuss key remaining challenges in this research area.