MECCANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain

F. Ragusa1,2, A. Furnari1,2, G. M. Farinella1,2

1FPV@IPLab, Department of Mathematics and Computer Science - University of Catania, Italy
2Next Vision s.r.l., Spin-off of the University of Catania, Italy

Running ICIAP competition with Prize!

Previous version: The MECCANO Dataset: Understanding Human-Object Interactions from Egocentric Videos in an Industrial-like Domain

Wearable cameras allow to acquire images and videos from the user’s perspective. These data can be processed to understand humans behavior. Despite human behavior analysis has been thoroughly investigated in third person vision, it is still understudied in egocentric settings and in particular in industrial scenarios. To encourage research in this field, we present MECCANO, a multimodal dataset of egocentric videos to study humans behavior understanding in industrial-like settings. The multimodality is characterized by the presence of gaze signals, depth maps and RGB videos acquired simultaneously with a custom headset. The dataset has been explicitly labeled for fundamental tasks in the context of human behavior understanding from a first person view, such as recognizing and anticipating human–object interactions. With the MECCANO dataset, we explored six different tasks including (1) Action Recognition, (2) Active Objects Detection and Recognition, (3) Egocentric Human–Objects Interaction Detection, (4) Egocentric Gaze Estimation, (5) Action Anticipation and (6) Next-Active Objects Detection. We propose a benchmark aimed to study human behavior in the considered industrial-like scenario which demonstrates that the investigated tasks and the considered scenario are challenging for state-of-the-art algorithms.


MECCANO Multimodal comprises multimodal egocentric data acquired in an industrial-like domain in which subjects built a toy model of a motorbike. The multimodality is characterized by the gaze signal, depth maps and RGB videos acquired simultaneously. We considered 20 object classes which include the 16 classes categorizing the 49 components, the two tools (screwdriver and wrench), the instructions booklet and a partial_model class.

Additional details related to the MECCANO:

  • 20 different subjects in 2 countries (IT, U.K.)
  • 3 modalities: RGB, Depth and Gaze
  • Video Acquisition. RGB: 1920x1080 12.00 fps, Depth: 640x480 12.00 fps
  • Gaze: frequency at 200Hz
  • 11 training videos and 9 validation/test videos
  • 8857 video segments temporally annotated indicating the verbs which describe the actions performed
  • 64349 active objects annotated with bounding boxes in contact frames
  • 48024 next-active objects annotated in past frames
  • 89628 hands annotated with bounding boxes in past frames and contact frames
  • 12 verb classes, 20 objects classes and 61 action classes

You can download the MECCANO dataset and annotations at: .

RGB Videos
RGB Frames
Depth Frames
Gaze Data
Action Temporal Annotations
EHOI Verb Temporal Annotations
Active Object Bounding Box Annotationsand frames
Hands Bounding Box Annotations
Next-Active Object Bounding Box Annotations


Visit our github repository.


The MECCANO dataset is suitable to study a variety of tasks, considering its multimodality and the challenging industrial-like scenario in which it was acquired. We considered five tasks related to human's behavior understanding for which we provide baseline results: 1) Action Recognition, 2) Active Objects Detection and Recognition, 3) Egocentric Human-Objects Interaction Detection, 4) Egocentric Gaze Estimation, 5) Action Anticipation and 6) Next-Active Objects Detection.

Action Recognition

Action Recognition consists in determining the action performed by the camera wearer from an egocentric video segment. Given a segment, the aim is to assign the correct action class.

Active Object Detection and Recognition

The aim of the Active Object Detection task is to detect all the active objects involved in EHOIs. The goal is to detect with a bounding box each active object. The Active Object Recognition task consists also in assigning them the correct class label considering the 20 object classes of the MECCANO dataset.

EHOI Detection

The goal is to determine egocentric human-object interactions (EHOI) in each image. In particular, the aim is to detect and recognize all the active objects in the scene with a bounding box, as well as the verb describing the action performed by the human.

Egocentric Gaze Estimation

The goal of the Egocentric Gaze Estimation task is to predict the 2D gaze location in each frame of the input video clip with a spatial resolution of 𝐻𝑥𝑊 and fixed length 𝑇.

Action Anticipation

The goal of the Action Anticipation task is to predict future egocentric actions from an observation of the past.

Next-Active Object Detection

The aim of the Next-Active Object Detection task is to detect and recognize all the objects which will be involved in a future interaction.

Other Tasks

MECCANO is also suitable to explore different tasks other than those considered in this work.
  • Procedural Learning: Given multiple videos of a task, the goal is to identify the key-steps and their order to perform the task. More details and annotations are available here. Please consider citing also the following work if you use these annotations:
                			author="Bansal, Siddhant and Arora, Chetan and Jawahar, C.V.",
                			title="My View is the Best View: Procedure Learning from Egocentric Videos",
                			booktitle = "European Conference on Computer Vision (ECCV)",


F. Ragusa, A. Furnari, G. M. Farinella. MECCANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain. Computer Vision and Image Understanding 2023.

Cite our paper: CVIU or Arxiv.
                    doi = {},
                    url = { },
                    year = { 2023 },
                    title = { MECCANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain },
                    journal = {  Computer Vision and Image Understanding (CVIU)  },
                    author = { Francesco Ragusa and Antonino Furnari and Giovanni Maria Farinella },

Additionally, cite the original paper: CVF or Arxiv.
    					author    = {Ragusa, Francesco and Furnari, Antonino and Livatino, Salvatore and Farinella, Giovanni Maria},
    					title     = {The MECCANO Dataset: Understanding Human-Object Interactions From Egocentric Videos in an Industrial-Like Domain},
    					booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
    					month     = {January},
    					year      = {2021},
    					pages     = {1569-1578}

Supplementary Material

More details on the dataset and the annotation phase can be found in the supplementary material associated to the publication.


This research is supported by Next Vision, by MISE - PON I&C 2014-2020 - Progetto ENIGMA - Prog n. F/190050/02/X44 – CUP: B61B19000520008, and MIUR AIM - Attrazione e Mobilita Internazionale Linea 1 - AIM1893589 - CUP: E64118002540007.