The MECCANO Dataset: Understanding Human-Object Interactions from Egocentric Videos in an Industrial-like Domain

F. Ragusa1,3, A. Furnari1, S. Livatino2, G. M. Farinella1

1IPLab, Department of Mathematics and Computer Science - University of Catania, IT
2University of Hertfordshire, Hatfield, Hertfordshire, U.K.
Xenia Gestione Documentale s.r.l. - Xenia Progetti s.r.l., Acicastello, Catania, IT

Wearable cameras allow to collect images and videos of humans interacting with the world. While human-object interactions have been thoroughly investigated in third person vision, the problem has been understudied in egocentric settings and in industrial scenarios. To fill this gap, we introduce MECCANO, the first dataset of egocentric videos to study human-object interactions in industrial-like settings. MECCANO has been acquired by 20 participants who were asked to build a motorbike model, for which they had to interact with tiny objects and tools. The dataset has been explicitly labeled for the task of recognizing human-object interactions from an egocentric perspective. Specifically, each interaction has been labeled both temporally (with action segments) and spatially (with active object bounding boxes). With the proposed dataset, we investigate four different tasks including 1) action recognition, 2) active object detection, 3) active object recognition and 4) egocentric human-object interaction detection, which is a revisited version of the standard human-object interaction detection task. Baseline results show that the MECCANO dataset is a challenging benchmark to study egocentric human-object interactions in industrial-like scenarios.


The MECCANO dataset has been acquired in an industrial-like scenario in which subjects built a toy model of a motorbike. We considered 20 object classes which include the 16 classes categorizing the 49 components, the two tools (screwdriver and wrench), the instructions booklet and a partial_model class.

Additional details related to the MECCANO:

  • 20 different subjects in 2 countries (IT, U.K.)
  • Video Acquisition: 1920x1080 at 12.00 fps
  • 11 training videos and 9 validation/test videos
  • 8857 video segments temporally annotated indicating the verbs which describe the actions performed
  • 64349 active objects annotated with bounding boxes
  • 12 verb classes, 20 objects classes and 61 action classes

You can download the MECCANO dataset and annotations at: .


Visit our github repository.


The MECCANO dataset is suitable to study a variety of tasks, considering the challenging industrial-like scenario in which it was acquired. We consider four tasks for which we provide baseline results: 1) Action Recognition, 2) Active Object Detection, 3) Active Object Recognition and 4) Egocentric Human-Object Interaction (EHOI) Detection.

Action Recognition

Action Recognition consists in determining the action performed by the camera wearer from an egocentric video segment. Given a segment, the aim is to assign the correct action class.

Active Object Detection

The aim of the Active Object Detection task is to detect all the active objects involved in EHOIs. The goal is to detect with a bounding box each active object.

Active Object Recognition

The task consists in detecting and recognizing the active objects involved in EHOIs considering the 20 object classes of the MECCANO dataset. The task consists in detecting active objects with a bounding box and assigning them the correct class label.

EHOI Detection

The goal is to determine egocentric human-object interactions (EHOI) in each image. In particular, the aim is to detect and recognize all the active objects in the scene with a bounding box, as well as the verb describing the action performed by the human.


F. Ragusa, A. Furnari, S. Livatino, G. M. Farinella. The MECCANO Dataset: Understanding Human-Object Interactions from Egocentric Videos in an Industrial-like Domain. In IEEE Winter Conference on Applications of Computer Vision (WACV) 2021. Download the paper.

Supplementary Material

More details on the dataset, the annotation phase and additional information about implementation details and experiments can be found in the supplementary material associated to the publication.


This research has been supported by MIUR PON PON R&I 2014-2020 - Dottorati innovativi con caratterizzazione industriale, by MIUR AIM - Attrazione e Mobilita Internazionale Linea 1 - AIM1893589 - CUP: E64118002540007, and by MISE - PON I&C 2014-2020 - Progetto ENIGMA - Prog n. F/190050/02/X44 – CUP: B61B19000520008.