FacebookFacebook group TwitterTwitter
ICVSS Computer Vision in the Age of Large Language Models

Scaling Language and World Models: Stochastic Realization Beyond Markov and Towards Solomonoff

Stefano Soatto

Amazon and University of California Los Angeles, USA

Abstract

Machine Learning has progressed significantly as an empirical discipline, accelerated in recent years by the advent of Large Language Models (LLMs) and so-called World Models (WMs). The tools to interpret and guide such progress have been rooted in the theory of inductive learning, based on concepts of generalization and regularization. Generalization theory eschews memorization to avoid overfitting the training set. In contrast to inductive learning, transductive inference arrives at an hypothesis by accessing all available data, past and present, at inference time. Rather than generalization, transductive inference aims for sample-specific computation, utilizing the training data ex-post rather than ex-ante. The optimal limit of transductive inference has been characterized by Solomonoff in 1965, and involves no generalization but rather test-time computation based on eidetic (lossless) memory of all the past. While the Solomonoff limit is unattainable, empirically Large Language Models have been shown to be able to memorize, albeit imperfectly, and to perform universal computation, at least when augmented with a scratchpad. They also exhibit a rudimentary form of transductive inference, now called `in-context learning'. Could it be that they are already beyond generalization, marching towards the Solomonoff limit? How can we test this hypothesis empirically? And if we want to improve their performance, and our ability to control them, what do we need to do? Do we aim to regularize, generalize, and avoid memorization? Or do we foster better memorization and avoid regularization instead? Stochastic Realization Theory was developed in the 1970's to model sequence data. Among the goals were to define and infer functions of all past data, called `states,' which are sufficient for predicting future data and therefore for any other decision and control task based on it. Under suitable conditions, the Separation Principle guaranteed that one could control a physical system by designing a control for the model of the system, which is separate from the underlying system that we do not and cannot directly observe. While this allowed humans to control spacecrafts all the way to the moon, subsequent attempts to generalize the theory by finding finite-dimensional (Markov) sufficient statistics were frustrated until, in the 1990's, strong results on the non-existence of such finite-dimensional statistics were discovered. Since, at the time, unbounded memory and compute were unthinkable, the theory dwindled and the practice remained anchored on ideas from inductive learning and biological inspiration. In biology, we have no choice but to practice inductive learning. Memory is hard-bounded, and there is a strong bias on low inference latency. When facing an animal in the wild, our ancestors had to decide quickly whether it was their dinner or vice-versa; no time to revisit old memories and reason about the various hypotheses. So, optimal transductive inference is not viable in biology. In technology today, however, we no longer have hard memory limitations. Perhaps it is time to revisit some old ideas from Stochastic Realization Theory, and update them for the modern age, with Solomonoff Induction in the background. Some of these ideas are being rediscovered in the design and implementation of State Space Models (SSMs). These are bi-linear dynamical models, of the kind studied in the Sixties and Seventies, stacked and scaled, with their parameters trained inductively on massive sequence data. Is this the right class of architectures for LLMs? Or for World Models? What is the `language' they learn? (call it Neuralese). How do symbols emerge, and how does the trained model compose them and operate on them? To begin addressing some of these questions, I will review some basic ideas from Solomonoff Induction and Stochastic Realization Theory, and describe how current SSMs as well as Transformers are special cases of canonical realizations called Controllable Canonical Forms. I will then discuss how `meanings' emerge in these trained models, and how such meanings can be observed and controlled.