Egocentric Visitor Localization and Artwork Detection in Cultural Sites Using Synthetic Data

We present a tool to generate simulated egocentric data starting from a 3D model of a cultural sites, useful to study the problem of Image-Based Localization (IBL) and artworks detection in cultural sites. In particular our focus is on the localization of users wearing an egocentric camera useful to augment the experience of visitors in cultural sites. The contributions of this work are:

a general methodology to create synthetic dataset automatically labelled with the 6 Degrees of Fredom pose (6DoF) of the camera, to study the IBL problem and artworks detection;
a first large dataset of egocentric navigations collected considering a 3D model from the S3DIS Dataset (Area 3);
a second dataset of egocentric navigation collected considering the 3D model of the cultural site Galleria Regionale Palazzo Bellomo¹;
a benckmark study of 3DoF indoor localization to asses the effectiveness of our tool considering an IBL pipeline based on metric learning with a triplet network and an image-retrieval procedure;
a benckmark study for artwork detection of 16 artworks for the second dataset.

Results and Demo

- Bellomo dataset

A video example of localization system performed in Bellomo dataset is shown below. We performed this localization considering the test set frames. We show the predicted position with/without smoothing filter (best results width 25% trimmed mean filter w=5).

- Stanford dataset

A video example of localization system performed in Stanford dataset is shown below. We performed this localization considering a path from the test set.

Results

Results with temporal smoothing

Results of artwork detection

Average precision obtained by the considered detectors for each artwork.

Qualitative examples of detectors trained on the Bellomo dataset

Dataset

- Bellomo dataset

We collected the second dataset using the developed Tool V.2 in Unity 3D. The dataset concerns to cultural heritage, in particular the Galleria Regionale Palazzo Bellomo. It comprises of 4 simulated navigations generated simulating a visit inside the museum. During each navigation the agenta "look at" 5 different observation points for each artwork in the current room. During the navigation, we acquired at 5 frames per second and for each frame we saved the RGB frame, Semantic Mask, 6DoF camera pose, ID of the current room. We obtain a large dataset without any manual labeling for both the two tasks (localization and artwork detection).

For more details about our dataset go to this page .

You can download the Bellomo dataset at this link .

- Stanford dataset

We collected a large dataset of Simulated Egocentric Navigations using the developed Tool in Unity 3D. The dataset comprises 90 paths generated simulating three virtual agents with different heights (1.5 m, 1.6 m, 1.7m), 30 paths for each agent. During the navigation, we acquired at 30 frames per second and for each frame we saved the RGB Map, Depth Map and the 6DoF of the camera. In this way, we obtain a large dataset without any manual labelling. We provide a 3DoF version of the Dataset comprising labels and RGB images.

We considered a total of 90 random paths, 30 paths for each height of the agent [150, 160, 170]cm. Each path is performed by letting the agent reach 21 target points. For more details about our dataset go to this page .

You can download the dataset at this link .

Method

We propose a tool for Unity 3D to generate synthetic dataset to study the IBL problem. The proposed pipeline above comprises two main steps:

Virtualization of the environment
Simulation and generation of the dataset

We propose acquired a 3D model of a real cultural heritage and use this one and second one on related work, to simulte navigations inside the virtual environment. For the new dataset, we performed experiments on 3DoF localization and artwork detection.

The experiment of 3DoF localization considering an image retrieval pipeline and using a Triplet Network with an Inception V3 backbone to learn useful representation for localization task. We trained the model to learn a representation space and performed a 1-nearest Neighbor search to assign a 3DoF label to a new query image.

For the Bellomo dataset, at training time, we fed the model using triplets generated by sampling 25%, 50%, 75% and 100% images from the Training set, whereas at test time we used the navigation used as Test set to extract the representation learned during training. For each Test image we assigned the label of the nearest nighbor image of the Training set.

For the Stanford dataset, at training time, we fed the model using triplets generated by sampling 10000, 20000, 30000 and 40000 images from the Training set. The remain part of pipeline is the same
For both the dataset we evaluate post-processing tecniques of temporal smoothing to improve the results on predicted labels.
The artwork detection has been performed only on the Bellomo datase.

Tool

The tool for the generation of the dataset is described and available at this link

Code

An example of how to use a learned model is available at this link

Paper

S. A. Orlando, A. Furnari, G. M. Farinella - Egocentric Visitor Localization and Artwork Detectionin Cultural Sites Using Synthetic Data. Pattern Recognition Letters - Special Issue on Pattern Recognition and Artificial Intelligence Techniques for Cultural Heritage, 2020. Download the paper here.

Supplementary Material

More details on the dataset and qualitative results can be found in the supplementary material associated to the publication.

Acknowledgement

This research is supported by PON MISE - Horizon 2020, project VEDI - Vision Exploitation for Data Interpretation, PO FESR 2014/2020 – Azione 1.1.5, project VALUE - Visual Analysis for Localization and Understanding of Environments, by DWORD - Xenia Projetti s.r.l. and Piano della Ricerca 2016-2018 lineadi Intervento 2 of DMI, University of Catania. The authors would like to thank Regione Siciliana Assessorato dei Beni Culturali dell’Identità Siciliana - Dipartimento dei Beni Culturali e dell’Identità Siciliana and Polo regionale di Siracusa per i siti culturali - Galleria Regionale di Palazzo Bellomo.

Related Work

Please visit our page dedicated to First Person Vision Research @ IPLAB.

S. A. Orlando, A. Furnari, S. Battiato, G. M. Farinella - Image Based Localization with Simulated Egocentric Navigations. In International Conference on Computer Vision Theory and Applications (VISAPP), 2019. Download the paper here.

People

Santi Andrea Orlando

Antonino Furnari

Giovanni Maria Farinella