FacebookFacebook group TwitterTwitter
ICVSS Computer Vision - Where are we?

Learning Representations: From Shannon to Fisher to Bayes to Kolmogorov via Deep Networks and the Implications to Visual Information Processing in Biology and in the Cloud

Stefano Soatto

Amazon and University of California Los Angeles, USA

Abstract

Representations are functions of the data that are "useful" for a task. Of all functions, one wishes to design or learn those that contain all the "information" in the data, and none of the variability that is irrelevant to the task. Depending on how one defines and measures "useful" and "information", different notions of representations can be instantiated. What are the relationships among those? Are there common principles behind the different tools and models? Is there a common notion of "optimality" that emerges from all formalisms? If so, are such optimal representations computable? If not, can they be approximated? If such representations are learned using "past data" (training set), can we predict how well they will perform on "future data" (test set)? These questions have nothing to do with Deep Learning, but understanding them sets the stage for the second part of the lecture. In Deep Learning, we are given a training set, and we minimize a loss function that, at least at face value, knows nothing about "future data". Just like the activations of a network in response to a test datum can be understood as a representation of future data, the parameters (weights) of a network can be understood as a representation of the past training set. What properties should the weights exhibit that can be optimized during training, which ensure that desirable properties of the activations emerge? Is there something special in deep neural networks that addresses this issue of generalization? Do these properties translate to a variational principle? Does this principle have anything to do with optimality of representations? Can they be imbued into the optimization we use to train deep networks?

In the first part of this lecture we will derive a theory of representation that is the first to address these questions for deep learning. The question the theory answers is: "What are the functions of given (past) data one can compute, so that the resulting representation of future data is best for the task at hand?" What it does not address is what happens when the task is not completely specified beforehand.

In the second part of the lecture, we will dive deep into how such representations can be computed in practice, and what to do when the task is not specified at the outset.

It is common in deep learning to pre-train a model for a task (say, finding cats and dogs in images), and then fine-tune it on another (say, finding tumors in a mammogram or controlling a self-driving car). This practice works sometimes, but until recently we have not had means to predict how well, other than just trying it. To understand transfer learning, the first step is to endow the space of learning task with a topology: When are two tasks "close"? Is a model pre-trained on task A better for task B than one pre-trained on task C? Given a number of pre-trained models, can we say which is best to fine-tune, without actually running any experiment? We will describe a way of measuring the complexity of a learning task, which is predictive of training cost, as well as measuring the distance between two tasks, which is predictive of the success of transfer learning.

The method of choice in deep learning is to optimize an information loss (e.g. empirical cross-entropy) using stochastic gradient descent (SGD). Almost all the analysis of this process is restricted to the asymptotics (e.g., convergence, geometry of the residual surface, flat minima etc.). It turns out, the process of defining a distance between learning tasks reveals surprising properties of deep learning that point us to study not the asymptotics, but the transient of deep learning: It is possible for two tasks to be very close to each other, yet one cannot fine-tune one from another. This means that not just the local geometry of the loss around the stationary points is important, but the entire path going from one to another. There are paths that are easy to follow in one direction, but not the other way around, even if the start and end points have the same identical energy. This is puzzling, and surprisingly common across different tasks, loss functions, and architecture, including biological ones. Take the task of classifying images that are slightly blurred (pre-training), and then fine-tune the model to classify the same images, but sharp. If we pre-train too long, the final performance in the latter task is worse than if we started from scratch. An analogous experiment on cats with cataracts was performed by Hubel and Wiesel in the Sixties.

This phenomenon reveals that "critical learning periods", thus far believed a side-effect of biology, are an information phenomenon, not a biological one: The most critical phase of learning is not the asymptotic one, but the initial transient. During this phase, the information in the representation is "plastic". It can move across layers, and re-organize the effective connectivity of a network. Beyond this phase, the network reduces information (forgets), while improving performance in the task. Information plasticity is lost. This suggests interventions that can be applied early in the learning phase to optimize learning, and to define a "dynamic distance" between learning tasks that accounts not only for whether two tasks are "close", but whether one is "accessible" or "reachable" from another, which is critical for curriculum and lifelong learning.

We will conclude the lecture with a digest of open problems, and suggestions for investigation where the theory can help formulate hypotheses that can be tested, in biology and in artificial neural networks alike.

Joint work with Alessandro Achille.