Self-supervised Learning of Visual Representations for Perception and Action

Abhinav Gupta

Carnegie Mellon University, US

Abstract

In this talk, I will discuss how to learn representation for perception and action without using any manual supervision. First, I am going to discuss how we can learn ConvNets for vision in a completely unsupervised manner using auxiliary tasks. We are going to see how different forms of signal in the data can act as supervision: context, time, audio, stereo etc. Next, I am going to talk about how we can use a robot to physically explore the world and learn visual representations for classification/recognition tasks. Finally, I am going to talk about how we can perform end-to-end learning for actions using self-supervision. We will start with grasping and explore other tasks such as poking. Finally, we will see how this paradigm can be scaled up!