Meaningful unsupervised learning: from labelling to 3D geometry

Andrea Vedaldi

University of Oxford, UK

Abstract

Learning without explicit supervision is one of the most important and exciting areas of research in machine learning and computer vision. In this lecture, I will review techniques such as contrastive learning that, in the past few years, have established themselves as gold-standard approaches for learning image features in an unsupervised manner. I will then discuss the difference between learning features and learning "meaning", intended as an interpretation of the data which can be understood by a human without the need for supervised translation. I will show strong unsupervised representations do lead to strong semantic clustering and segmentation results, and suggests how this can be the basis for measuring the "semantic content" of a representation. However, I will also argue that looking at images as mere 2D patterns can only lead to crude interpretations. I will then discuss the necessity of rooting image understanding on 3D geometry. Understanding geometry makes image data analysis much more parsimonious and bases it on an intuitive language that is actionable and interpretable. I will demonstrate how advances such as Neural Rendering can be used to decompose images into objects, including interpreting challenging egocentric videos, and how highly-deformable objects can be learned from casually recorded videos. I will also discuss new benchmarks such as the Common Objects in 3D challenge that promote progress in this area.