We consider the problem of detecting and recognizing the objects observed by visitors (i.e., attended objects) in cultural sites from egocentric vision. A standard approach to the problem involves detecting all objects and selecting the one which best overlaps with the gaze of the visitor, measured through a gaze tracker. Since labeling large amounts of data to train a standard object detector is expensive in terms of costs and time, we propose a weakly supervised version of the task which leans only on gaze data and a frame-level label indicating the class of the attended object. To study the problem, we present a new dataset composed of egocentric videos and gaze coordinates of subjects visiting a museum. We hence compare three different baselines for weakly supervised attended object detection on the collected data. Results show that the considered approaches achieve satisfactory performance in a weakly supervised manner, which allows for significant time savings with respect to a fully supervised detector based on Faster R-CNN. To encourage research on the topic, we publicly release the code and the dataset.
We asked 7 subjects (aged between 24 and 40) to capture egocentric videos while visiting a cultural site
containing 15 objects of interest. Videos have beenacquired using a head-mounted Microsoft HoloLens2 device in
the room V ofthe Palazzo Bellomo located in Siracusa. In order to acquire the videos andthe gaze coordinates,
we developed a HoloLens2 application based on Unity.
Additional details related to the dataset:
M. Mazzamuto, F. Ragusa, A. Furnari, G. Signorello, G. M. Farinella. Weakly Supervised Attended Object Detection Using Gaze Data as Annotations. International Conference on Image Analysis and Processing (ICIAP 2021). Download the paper.