Weakly Supervised Attended Object Detection Using Gaze Data as Annotations

M. Mazzamuto1, F. Ragusa1,3, A. Furnari1,3, G. Signorello3, G. M. Farinella1,3

1FPV@IPLAB, DMI - University of Catania, IT
2Next Vision s.r.l. - Spinoff of the university of Catania, IT
CUTGANA - University of Catania, IT

We consider the problem of detecting and recognizing the objects observed by visitors (i.e., attended objects) in cultural sites from egocentric vision. A standard approach to the problem involves detecting all objects and selecting the one which best overlaps with the gaze of the visitor, measured through a gaze tracker. Since labeling large amounts of data to train a standard object detector is expensive in terms of costs and time, we propose a weakly supervised version of the task which leans only on gaze data and a frame-level label indicating the class of the attended object. To study the problem, we present a new dataset composed of egocentric videos and gaze coordinates of subjects visiting a museum. We hence compare three different baselines for weakly supervised attended object detection on the collected data. Results show that the considered approaches achieve satisfactory performance in a weakly supervised manner, which allows for significant time savings with respect to a fully supervised detector based on Faster R-CNN. To encourage research on the topic, we publicly release the code and the dataset.


We asked 7 subjects (aged between 24 and 40) to capture egocentric videos while visiting a cultural site containing 15 objects of interest. Videos have beenacquired using a head-mounted Microsoft HoloLens2 device in the room V ofthe Palazzo Bellomo located in Siracusa. In order to acquire the videos andthe gaze coordinates, we developed a HoloLens2 application based on Unity.

Additional details related to the dataset:

  • 7 subjects (aged between 24 and 40)
  • Video Acquisition: 2272×1278 pixels at 30 fpss
  • 11 training videos and 3 validation/test videos
  • 178977 frames woth object of interest annotated with bounding boxes
  • 15 objects of interest (8 of the considered objects of interest represent details of the artwork “Annunciazione”).

You can download the dataset and annotations at: .


Visit our github repository.


We have investigated three different approaches to predict the attended objects in a weakly supervised manner, by relying only on gaze data and a frame level labelof the attended object. The approaches are described in the following sections. 1) Sliding Window approach, 2) Fully Convolutional attended object detection, 3) Finetuned Fully Convolutional Attended Object Detection and 4) Fully-supervised attended object detector.

Sliding Window approach

We first investigate a simple approach consisting in exploiting a ResNet18 CNNto obtain a semantic segmentation mask by classifying each image patch to inferwhether it belongs to one of the objects of interest or to none of them (“other”). At test time we classify all image patches (with a size of 300x300 pixels) within the image using a sliding window. The result is a segmentation mask where each element is an integer between 0 and 15 (the ID of the considered classes).

Fully Convolutional attended object detection

Sliding Window approach has the main drawback of being slow. Indeed, processing an image at full resolution (e.g., 2272×1278 pixels)takes up to 168 seconds on a Tesla-K80 GPU. To speed up the approach, we modify the trained ResNet by removing the Global AveragePooling operation and replacing it with a fully connected classifier with a 1×1convolutional layer.

Finetuned Fully Convolutional Attended Object Detection

While the fully convolutional approach is much faster than the one based on thesliding window, we found the latter to be more accurate in our experiments. We use a sample of the coarse segmentation masks extracted from the training set usingthe sliding window approach to finetune the fully convolutionalmodel.

Fully-supervised attended object detector

We compare the proposed weakly supervised approaches with a fully-supervised baseline based on a Faster-RCNN object detector. The detector gives asoutput the bounding boxes (2D coordinates) of all objects of interest present inthe image. Using the 2D coordinates of the gaze, we select the bounding boxwhich includes the gaze coordinates.


M. Mazzamuto, F. Ragusa, A. Furnari, G. Signorello, G. M. Farinella. Weakly Supervised Attended Object Detection Using Gaze Data as Annotations. International Conference on Image Analysis and Processing (ICIAP 2021). Download the paper.


This research has been supported by Next Vision s.r.l. , by the project VALUE (N. 08CT6209090207 - CUP G69J18001060007) - PO FESR 2014/2020 -Azione 1.1.5., and by Research Program Pia.ce.ri. 2020/2022 Linea 2 - University of Catania.