What Would You Expect? Anticipating Egocentric Actions with Rolling-Unrolling LSTMs and Modality Attention

Antonino Furnari, Giovanni Maria Farinella

Egocentric action anticipation consists in understanding which objects the camera wearer will interact with in the near future and which actions they will perform. We tackle the problem proposing an architecture able to anticipate actions at multiple temporal scales using two LSTMs to 1) summarize the past, and 2) formulate predictions about the future. The input video is processed considering three complimentary modalities: appearance (RGB), motion (optical flow) and objects (object-based features). Modality-specific predictions are fused using a novel Modality ATTention (MATT) mechanism which learns to weigh modalities in an adaptive fashion. Extensive evaluations on two large-scale benchmark datasets show that our method outperforms prior art by up to +7% on the challenging EPIC-KITCHENS dataset including more than 2500 actions, and generalizes to EGTEA Gaze+. Our approach is also shown to generalize to the tasks of early action recognition and action recognition.

The goal of an action anticipation method is to predict egocentric actions from an observation of the past. The input of the model is a video segment t_o seconds long (observation time) preceding the start time of the action t_s by t_a seconds (anticipation time). Since the future is naturally uncertain, action anticipation methods usually predict more than one possible actions (e.g., the TOP-3 predictions).

Method

The proposed method uses two LSTMS to 1) encode streaming observations, 2) make predictions about the future. The first LSTM, termed "Rolling LSTM" processes a frame representaion at each time step, with the aim of continuously encoding the past. When a new prediction is required, the "Unrolling LSTM" is initialized from the internal state of the Rolling LSTM. The Unrolling LSTM is hence unrolled for a number of time-step equal to those needed to reach the beginning of the action to formulate the final prediction. The proposed method processes information according to multimple modalities (RGB frames, Optical Flows and Object-based features). The predictions made by the different modalities are combined by a novel modality attention mechanism which learns to weigh the different modalities depending on the observed sample. See the paper for more details.

Videos

Egocentric Action Anticipation Examples

Early Action Recognition Examples

Paper

A. Furnari, G. M. Farinella, Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI). 2020. Paper

@article{furnari2020rulstm,
  year = { 2020 },
  title = { Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video },
  journal = { IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) },
  author = { Antonino Furnari and Giovanni Maria Farinella },
}

A. Furnari, G. M. Farinella, What Would You Expect? Anticipating Egocentric Actions with Rolling-Unrolling LSTMs and Modality Attention. International Conference on Computer Vision. 2019. Paper

@inproceedings{furnari2019rulstm, 
  title = { What Would You Expect? Anticipating Egocentric Actions with Rolling-Unrolling LSTMs and Modality Attention. }, 
  author = { Antonino Furnari and Giovanni Maria Farinella },
  year = { 2019 },
  booktitle = { International Conference on Computer Vision },
}

Code

Please check out our code and models at https://github.com/fpv-iplab/rulstm.

Acknowledgement

This research is supported by Piano della Ricerca 2016-2018 linea di Intervento 2 of DMI of the University of Catania.

People

Antonino Furnari

Giovanni Maria Farinella