Abstract

Learning to Detect Attended Objects in Cultural Sites with Gaze Signals and Weak Object Supervision

Cultural sites such as museums and monuments are popular tourist destinations worldwide. Visitors come to these places to learn about the cultures, histories and arts of a particular region or country. However, for many cultural sites, traditional visiting approaches are limited and may fail to engage visitors. To enhance visitors' experiences, previous works have explored how wearable devices can be exploited in this context. Among the many functions that these devices can offer, understanding which artwork or detail the user is attending to is fundamental to provide additional information on the observed artworks, understand the visitor's tastes and provide recommendations. This motivates the development of algorithms for understanding visitor attention from egocentric images. We considered the attended object detection task, which involves detecting and recognizing the object observed by the camera wearer, from an input RGB image and gaze signals. To study the problem, we collect a dataset of egocentric images collected by subjects visiting a museum. Since collecting and labeling data in cultural sites for real applications is a time-consuming problem, we present a study comparing unsupervised, weakly supervised, and fully supervised approaches for attended object detection. We evaluate the considered approaches on the collected dataset, assessing also the impact of training models on external datasets such as COCO and EGO-CH. The experiments show that weakly supervised approaches requiring only a 2D point label related to the gaze can be an effective alternative to fully supervised approaches for attended object detection.

Code Data
Abstract

Learning to Detect Attended Objects in Cultural Sites with Gaze Signals and Weak Object Supervision

Cultural sites such as museums and monuments are popular tourist destinations worldwide. Visitors come to these places to learn about the cultures, histories and arts of a particular region or country. However, for many cultural sites, traditional visiting approaches are limited and may fail to engage visitors. To enhance visitors' experiences, previous works have explored how wearable devices can be exploited in this context. Among the many functions that these devices can offer, understanding which artwork or detail the user is attending to is fundamental to provide additional information on the observed artworks, understand the visitor's tastes and provide recommendations. This motivates the development of algorithms for understanding visitor attention from egocentric images. We considered the attended object detection task, which involves detecting and recognizing the object observed by the camera wearer, from an input RGB image and gaze signals. To study the problem, we collect a dataset of egocentric images collected by subjects visiting a museum. Since collecting and labeling data in cultural sites for real applications is a time-consuming problem, we present a study comparing unsupervised, weakly supervised, and fully supervised approaches for attended object detection. We evaluate the considered approaches on the collected dataset, assessing also the impact of training models on external datasets such as COCO and EGO-CH. The experiments show that weakly supervised approaches requiring only a 2D point label related to the gaze can be an effective alternative to fully supervised approaches for attended object detection.

Code Data

The EGO-CH-Gaze Dataset

To study the problem of weakly supervised attended object detection in cultural sites, we collected and labeled a dataset of egocentric images acquired from subjects visiting a cultural site. The dataset has been designed to offer a snapshot of the subject’s visual experience while visiting a museum and contains labels for several artworks and details attended by the subjects.


Data Annotation

We asked 7 subjects (aged between 24 and 40) to capture egocentric videos while visiting a cultural site containing 15 objects of interest. Videos have beenacquired using a head-mounted Microsoft HoloLens2 device in the room V ofthe Palazzo Bellomo located in Siracusa. In order to acquire the videos andthe gaze coordinates, we developed a HoloLens2 application based on Unity.

  • 7 subjects (aged between 24 and 40)
  • 2272×1278 pixels Video Acquisition at 30 fpss
  • 178977 RGB frames with object of interest annotated with bounding boxes
  • 2 acquisition modality guided tours and free tours
  • 15 objects of interest 8 of the considered objects of interest represent details of the artwork “Annunciazione”

Museum
Visitors

7

RGB
Images

220300

Egocentric
Videos

14

Box
annotations

713100

Object
of interest

15

Tour
modality

2

Baselines & Proposed Approaches

Task

Differently from the standard object detection task which aims to detect and recognize all the objects present in the scene, we define attended object detection as the task of detecting and recognizing only the object observed by the camera wearer from the analysis of the RGB image and gaze. Formally, let \(O = \{o_1, o_2, \ldots, o_n\}\) be the set of objects in the image and let \(C = \{c_1, c_2, \ldots, c_n\}\) be the corresponding set of object classes. Given an image, the proposed task consists in detecting the attended object \(o_{att} \in O\), predicting its bounding box coordinates \((x_{att}, y_{att}, w_{att}, h_{att})\) and assigning it the correct class label \(c_{att}\).

Full supervision

At this supervision level, we assume that bounding boxes around all objects, plus gaze are available at training time, while only gaze is available at test time. We consider a baseline approach based on an object detector trained to detect and recognize all objects in the image. At test time, the attended object is predicted selecting the predicted object whose center is closest to the estimated gaze. In our experiments we consider both Faster-RCNN and RetinaNet as detectors

Weak Supervision

At this supervision level, we assume that gaze, plus some form of weak supervision on the attended object is available at training time, while only gaze is available at test time. We consider three versions of this supervision level, based on the “weakness” of the provided labels. The three versions are described in the paper

Unsupervised

At this supervision level, we tested salient object detection methods as baselines, despite challenges in determining the attended object's granularity. This approach reduces the need for extensive labeling. Attended object detection, though similar to salient object detection, may not always focus on the most visually distinct object. We also considered the "Segment Anything Model" (SAM) due to its recent success.

Proposed Approaches

Besides the considered baselines, we propose two new approaches to tackle attended object detection at the different weakly supervised levels, which offer the best trade-off in terms of performance and needed amount of supervision, as shown in our experiments. Specifically, we propose a box coordinates regressor algorithm which can be trained when bounding boxes around attended objects are available and a weakly supervised attended object detection approach which can be used when only the attended object class is available as a form of supervision

Box coordinates regressor.

This method predicts the attended object's bounding box using the user's 2D gaze coordinates as input, in contrast to traditional object detection techniques. It exclusively trains on the attended object's box, reducing required labels. The approach uses a ResNet18 feature extractor, followed by convolutional modules for estimating coordinates and dimensions, producing predictions based on gaze location. During training, Mean Squared Error loss is employed. At test time, the offset is added to the gaze position for the object's bounding box, determined by the size vector. A separate module classifies the object. If it predicts "other," the object is discarded. This approach can be trained with gaze + attended object bounding box data or gaze + attended object bounding box + class data.


Image

Fully Convolutional attended object detection

The sliding window approach faces a significant limitation due to its slow inference speed, driven by evaluating numerous windows. To address this, we adapted the ResNet18 model described in the same section. We replaced the Global Average Pooling operation with a 1X1 convolutional layer, enabling the model to predict a semantic segmentation mask for the entire image in a single forward pass. After initial coarse segmentation, up-sampling to the original resolution refines the segmentation mask. Utilizing the box fitting method discussed, we derive the predicted bounding box of the attended object. This approach, though faster than the sliding window approach, incurs a performance decrease, potentially due to domain shift from training on patches and evaluating on full images. To mitigate this issue, we fine-tuned the model, minimizing the average Kullback-Leibler (KL) distance between pixel-wise probability distributions, as outlined.

Image

Results

Results obtained by the compared approaches, when all classes are considered, for each level of supervision (from the highest to the lowest). In bold the best results by supervision group.




Results obtained adding a pre-training step with similar-context datasets by the compared approaches, when all classes are considered, for each level of supervision (from the highest to the lowest). In bold the best results by supervision group.




Results obtained by the compared approaches, when a generic "object" category is considered, for each level of supervision (from the highest to the lowest). In bold the best results by supervision group.

Download
Data
Frames
Download
Data
Annotations
Download
Code
Code
Download


Paper

M. Mazzamuto, F. Ragusa, A. Furnari, G. M. Farinella. Learning to Detect Attended Objects in Cultural Sites with Gaze Signals and Weak Object Supervision. 2023. Cite our paper: ArXiv.

COMING SOON...

Visit our page dedicated to First Person Vision Research for other related publications.


People
Michele
Mazzamuto
FPV@IPLAB
Next Vision s.r.l.
Francesco
Ragusa
FPV@IPLAB
Next Vision s.r.l.
Antonino
Furnari
FPV@IPLAB
Next Vision s.r.l.
Giovanni Maria
Farinella
FPV@IPLAB
Next Vision s.r.l.