Multimodal Action Recognition on the MECCANO Dataset

Workshop Schedule

Monday 11 September 2023. All dates are local to Udine's time, GMT+2.

Time	Event	Authors
14:00-14:15	Opening Session
14:15-14:50	Keynote	Oswald Lanz (Free University of Bozen-Bolzano & Covision Lab)
14:50-15:05	Report on the challenge and announcement of the winners
Accepted Reports Presentation – Session 1
15:05-15:15	A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion Recognition	Benjia Zhou, Yang Zhao, Jun Wan (Macau University of Science and Technology, Xiamen University, Institute of Automation, Chinese Academy of Sciences- CASIA) (Remotely)
15:15-15:25	Action Recognition on the MECCANO Dataset with Gate-Shift-Fuse Networks	Edoardo Bianchi, Oswald Lanz (Free University of Bozen-Bolzano) (In person)
15:25-16:00 Coffee Break
Accepted Reports Presentation – Session 2
16:00-16:10	Ensamble Modeling for Multimodal Visual Action Recognition	Jyoti Kini, Sarah Fleischer, Ishan Dave, Mubarak Shah (Center for Research in Computer Vision, University of Central Florida) (Remotely)
16:10-16:20	A novel UniFormer architecture for action recognition on the MECCANO benchmark	Yaxin Hu, Erhardt Barth (University of Lubeck, Pattern Recognition Company GmbH) (In person)
16:20-16:30	GADDCCANet for Multimodal Action Recognition	Kai Liu, Lei Gao, Ling Guan (Toronto Metropolitan University) (Remotely)
16:30-16:45	Closing Session

The Challenge

Methods are expected to take as input RGB, Depth, and Gaze signals to predict an action. Algorithms may also opt to process only a subset of these signals. The following figure shows the architecture of the baseline used to achieve this task using RGB, Depth and Gaze signals.

The participants will compete for the proposed task to obtain the best score on the provided test split of the MECCANO dataset available at the webpage of the challenge. Participants must submit their results via email to the organizers through a technical report (see emails below). Reports should evaluate action recognition using Top-1 and Top-5 accuracy computed on the whole test set. Submissions will be sorted based on Top-1 accuracy on the test split. All technical reports together with results will be published online in the web page of the challenge. The TOP-3 best results will be asked to release the code of their methods to ensure repetibility of experiments. Authors must use the ICIAP format for submitting their Tecnical Reports. The maximum number of pages is 4 excluding references.

Technical reports to participate the challenge as well as questions have to be submitted via email to the organizers: {francesco.ragusa, antonino.furnari, giovanni.farinella}@unict.it.

Prize

We will provide a prize of € 300,00 sponsored by Next Vision s.r.l., to the winner (first place) of the competition.

Leaderboard 2023

Rank	Team	Top-1 Accuracy	Top-5 Accuracy
1	UCF	52.82	83.85
2	UNIBZ	52.57	81.53
3	LUBECK	51.82	83.35
4	MACAU	50.30	78.46
5	Baseline (RGB-Depth-Gaze)	49.66	77.82
6	TORONTO	49.52	74.21
7	Baseline (RGB-Depth)	49.49	77.61
8	CUNY	24.69	52.46

Dataset

MECCANO Multimodal comprises multimodal egocentric data acquired in an industrial-like domain in which subjects built a toy model of a motorbike. The multimodality is characterized by the gaze signal, depth maps and RGB videos acquired simultaneously. We considered 20 object classes which include the 16 classes categorizing the 49 components, the two tools (screwdriver and wrench), the instructions booklet and a partial_model class.

Additional details related to the MECCANO:

20 different subjects in 2 countries (IT, U.K.)
3 modalities: RGB, Depth and Gaze
Video Acquisition. RGB: 1920x1080 12.00 fps, Depth: 640x480 12.00 fps
Gaze: frequency at 200Hz
11 training videos and 9 validation/test videos
8857 video segments temporally annotated indicating the verbs which describe the actions performed
64349 active objects annotated with bounding boxes in contact frames
48024 next-active objects annotated in past frames
89628 hands annotated with bounding boxes in past frames and contact frames
12 verb classes, 20 objects classes and 61 action classes

You can download the MECCANO dataset and annotations at: .

RGB Videos

RGB Frames

Depth Frames

Gaze Data

Action Temporal Annotations

EHOI Verb Temporal Annotations

Active Object Bounding Box Annotations

and frames

Hands Bounding Box Annotations

Next-Active Object Bounding Box Annotations

Code

Code of baselines and models are publicly available at our github repository.

Dates

Competition opening, Training/Test data available:	May 20 2023
Test results submission:	July 31 2023
Notification of accepted reports:	August 07 2023
Announcement of the winners:	During the conference

Reference Papers

F. Ragusa, A. Furnari, G. M. Farinella. MECCANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain. Computer Vision and Image Understanding 2023. CVIU.

F. Ragusa, A. Furnari, S. Livatino, G. M. Farinella. The MECCANO Dataset: Understanding Human-Object Interactions From Egocentric Videos in an Industrial-Like Domain. CVF or Arxiv.