Next Active Object Prediction from Egocentric Video

Antonino Furnari, Sebastiano Battiato, Kristen Grauman, Giovanni Maria Farinella

Although First Person Vision systems can sense the environment from the user's perspective, they are generally unable to predict his intentions and goals. Since human activities can be decomposed in terms of atomic actions and interactions with objects, intelligent wearable systems would benefit from the ability to anticipate user-object interactions. Even if this task is not trivial, the First Person Vision paradigm can provide important cues useful to address this challenge. Specifically, we propose to exploit the dynamics of the scene to recognize next-active-objects before an object interaction actually begins. We train a classifier to discriminate trajectories leading to an object activation from all others and perform next-active-object prediction using a sliding window. Next-active-object prediction is performed by analyzing fixed-length trajectory segments within a sliding window. We investigate what properties of egocentric object motion are most discriminative for the task and evaluate the temporal support with respect to which such motion should be considered. The proposed method compares favorably with respect to several baselines on the ADL egocentric dataset which has been acquired by 20 subjects and contains 10 hours of video of unconstrained interactions with several objects.

Sliding window processing of object tracks. At each time step, the trained binary classifier is run over the trajectories observed in the last h frames and a confidence score is computed.


The videos below illustrate correct predictions and failure examples. Each video reports several sequences preceding the activation point of specific next-active-objects. In each frame of the video, we highlight ground truth next-active-objects (in red), discarded objects (in gray), positive and negative model predictions (in green and blue respectively). We also display the egocentric object trajectories observed in the last 30 frames of the video.


In the reported success samples, the method is able to correctly detect the next active object and discard other passive objects. It should be noted that next-active-objects do not appear always in the center of the frame and they are not always static. Moreover, the method is able to detect next-active-objects a few seconds in advance in many cases.


In the reported failure samples, the method is not always able to correctly detect all next active objects and discard passive ones. It should be noted that, in many cases, detecting the next-active-object is not trivial even for the human observer.


[JVCI 2017] A. Furnari, S. Battiato, K. Grauman, G. M. Farinella, Next-Active-Object Prediction from Egocentric Videos, submitted to Journal of Visual Communication and Image Representation