Multimodal Action Recognition on the MECCANO Dataset

ICIAP Competition with Prize!

F. Ragusa1,2, A. Furnari1,2, G. M. Farinella1,2

1FPV@IPLab, Department of Mathematics and Computer Science - University of Catania, Italy
2Next Vision s.r.l., Spin-off of the University of Catania, Italy

{francesco.ragusa, antonino.furnari, giovanni.farinella}

Understanding worker’s behaviour in industrial environments is an underexplored topic due to the lack of public benchmark datasets. One of the most interesting information to know about users is which actions they are performing. In this context, we proposed a competition on MECCANO, a multimodal egocentric dataset acquired in an industrial-like domain in which subjects assembly a toy model of a motorbike. Each different signal provides additional information about the observed environment and the camera wearer, such as semantic information (RGB), 3D information of the environment and the objects (depth), as well as the user’s attention (gaze) which can be exploited to recognize human’s actions.

Workshop Schedule

Monday 11 September 2023. All dates are local to Udine's time, GMT+2.

Time Event Authors
14:00-14:15 Opening Session
Oswald Lanz
(Free University of Bozen-Bolzano & Covision Lab)
14:50-15:05 Report on the challenge and announcement of the winners
Accepted Reports Presentation – Session 1
A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion Recognition Benjia Zhou, Yang Zhao, Jun Wan
(Macau University of Science and Technology, Xiamen University, Institute of Automation, Chinese Academy of Sciences- CASIA) (Remotely)
Action Recognition on the MECCANO Dataset with Gate-Shift-Fuse Networks
Edoardo Bianchi, Oswald Lanz (Free University of Bozen-Bolzano) (In person)
15:25-16:00 Coffee Break
Accepted Reports Presentation – Session 2
Ensamble Modeling for Multimodal Visual Action Recognition
Jyoti Kini, Sarah Fleischer, Ishan Dave, Mubarak Shah
(Center for Research in Computer Vision, University of Central Florida) (Remotely)
A novel UniFormer architecture for action recognition on the MECCANO benchmark
Yaxin Hu, Erhardt Barth
(University of Lubeck, Pattern Recognition Company GmbH) (In person)
GADDCCANet for Multimodal Action Recognition
Kai Liu, Lei Gao, Ling Guan
(Toronto Metropolitan University) (Remotely)
16:30-16:45 Closing Session

The Challenge

Methods are expected to take as input RGB, Depth, and Gaze signals to predict an action. Algorithms may also opt to process only a subset of these signals. The following figure shows the architecture of the baseline used to achieve this task using RGB, Depth and Gaze signals.

The participants will compete for the proposed task to obtain the best score on the provided test split of the MECCANO dataset available at the webpage of the challenge. Participants must submit their results via email to the organizers through a technical report (see emails below). Reports should evaluate action recognition using Top-1 and Top-5 accuracy computed on the whole test set. Submissions will be sorted based on Top-1 accuracy on the test split. All technical reports together with results will be published online in the web page of the challenge. The TOP-3 best results will be asked to release the code of their methods to ensure repetibility of experiments. Authors must use the ICIAP format for submitting their Tecnical Reports. The maximum number of pages is 4 excluding references.

Technical reports to participate the challenge as well as questions have to be submitted via email to the organizers: {francesco.ragusa, antonino.furnari, giovanni.farinella}


We will provide a prize of € 300,00 sponsored by Next Vision s.r.l., to the winner (first place) of the competition.

Leaderboard 2023

Rank Team Top-1 Accuracy Top-5 Accuracy Technical Report
1 UCF 52.82 83.85
2 UNIBZ 52.57 81.53
3 LUBECK 51.82 83.35
4 MACAU 50.30 78.46
5 Baseline (RGB-Depth-Gaze) 49.66 77.82
6 TORONTO 49.52 74.21
7 Baseline (RGB-Depth) 49.49 77.61
8 CUNY 24.69 52.46


MECCANO Multimodal comprises multimodal egocentric data acquired in an industrial-like domain in which subjects built a toy model of a motorbike. The multimodality is characterized by the gaze signal, depth maps and RGB videos acquired simultaneously. We considered 20 object classes which include the 16 classes categorizing the 49 components, the two tools (screwdriver and wrench), the instructions booklet and a partial_model class.

Additional details related to the MECCANO:

  • 20 different subjects in 2 countries (IT, U.K.)
  • 3 modalities: RGB, Depth and Gaze
  • Video Acquisition. RGB: 1920x1080 12.00 fps, Depth: 640x480 12.00 fps
  • Gaze: frequency at 200Hz
  • 11 training videos and 9 validation/test videos
  • 8857 video segments temporally annotated indicating the verbs which describe the actions performed
  • 64349 active objects annotated with bounding boxes in contact frames
  • 48024 next-active objects annotated in past frames
  • 89628 hands annotated with bounding boxes in past frames and contact frames
  • 12 verb classes, 20 objects classes and 61 action classes

You can download the MECCANO dataset and annotations at: .

RGB Videos
RGB Frames
Depth Frames
Gaze Data
Action Temporal Annotations
EHOI Verb Temporal Annotations
Active Object Bounding Box Annotationsand frames
Hands Bounding Box Annotations
Next-Active Object Bounding Box Annotations


Code of baselines and models are publicly available at our github repository.


Competition opening, Training/Test data available: May 20 2023
Test results submission: July 31 2023
Notification of accepted reports: August 07 2023
Announcement of the winners: During the conference

Reference Papers

F. Ragusa, A. Furnari, G. M. Farinella. MECCANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain. Computer Vision and Image Understanding 2023. CVIU.

F. Ragusa, A. Furnari, S. Livatino, G. M. Farinella. The MECCANO Dataset: Understanding Human-Object Interactions From Egocentric Videos in an Industrial-Like Domain. CVF or Arxiv.