Egocentric Visitor Localization and Artwork Detection in Cultural Sites Using Synthetic Data


Santi Andrea Orlando1, 2, **, Antonino Furnari1, ** and Giovanni Maria Farinella1, 3, **

1Department of Mathematics and Computer Science, University of Catania, IT
2DWORD - Xenia S.r.l., Acicastello, Catania, IT
3National Research Council, ICAR-CNR, Palermo, IT

**These author are co-first authors and contributed equally to this work.


Bellomo Dataset


The dataset has been generated using the NEW tool for Unity 3D proposed in our paper:
S. A. Orlando, A. Furnari, G. M. Farinella - Egocentric Visitor Localization and Artwork Detectionin Cultural Sites Using Synthetic Data. Submitted in Pattern Recognition Letters on Pattern Recognition and Artificial Intelligence Techniques for Cultural Heritage special issue, 2020.

The tool used to generate the dataset is available at this link ).

The dataset has been generated using the "Galleria Regionale Palazzo Bellomo" 3D model scanned with Matterport. By using this model, we simulated egocentric navigations of 4 navigations with a first clockwise navigation and a second counterclowise navigation in accord to the room layout of the museum. In each room the virtual agent have to visit 5 observation points situated in front of each artworks of the museum.
We acquired (at 5 fps) 99, 769 images. The tool automatically labels each frame according to the 6 Degrees of Freedom (6DoF) of the camera: 1) camera position (x, y, z) and 2) camera orientation in quaternions (w, p, q, r).
Specifically we converted the 6DoF of the camera pose in 3DoF format taking in consideration:

  1. the camera position represented by the coordinates x and z according to the left-handed coordinate system of Unity;
  2. the camera orientation
    vector (u, v) which represent the rotation angle along the y-axis.



This figure shows an example of navigation schema performed by using the second version of the tool.

Navigations 1 2 3 4 Overall images
# of Frames 24, 525 25, 003 26, 281 23, 960 99, 769


To evaluate the quality of proposed dataset in term of usefulness fo localization task, for each test sample we perform an optimal nearest neighbor search associating it to the closest pose in the training set.
We also used the tool to label 16 artworks to experiment techniques of artwork detection in cultural sites. Examples of RGB frame with respective semantic mask for each artwork are show below.

This figure reports an example for each artwork of RGB frame and respective semantic mask.

List of artworks labeled in the Bellomo dataset, with the related coordinates in the 3D model and the RGB color codes used to produce the semantic masks.


You can download the dataset at this link

Training

We considered as training set the frames of 2nd and 3rd navigations. The Training set has been divided in subset each with 25%, 50%, 75% and 100% of the frames. We cast the Image Based Localization (IBL) as an image retrieval problem and use a Triplet Network to learn a suitable representation space for images. The network is trained using triplets, each comprising of: 1) the anchor frame I; 2) a similar image I + and 3) a dissimilar image I -. The triplets have been generated using: 1) as threshold for Euclidean distance 0.5 m, and 2) as threshold for Orientation distance 45°. We trained the network for 100 epochs, one model for each subset. We considered as Test set the frames of 1st navigation. We investigated techniques of temporal smoothing by applying three different filter: mean, median and 25% trimmed mean.



Stanford Dataset


The dataset has been generated using the tool for Unity 3D proposed in our paper:
S. A. Orlando, A. Furnari, S. Battiato, G. M. Farinella - Image Based Localization with Simulated Egocentric Navigations. In International Conference on Computer Vision Theory and Applications (VISAPP), 2019.

The tool is available at this link ).

The dataset has been generated using the "Area 3" 3D model from the S3DIS Dataset. By using this model, we simulated egocentric navigations of 90 paths, 30 paths for each height of the virtual agents that performed the navigations [150cm, 160cm, 170cm].
We acquired (at 30 fps) 886, 823 images. The tool automatically labels each frame according to the 6 Degrees of Freedom (6DoF) of the camera: 1) camera position (x, y, z) and 2) camera orientation in quaternions (w, p, q, r).
We converted the 6DoF of the camera in a 3DoF format taking in consideration:

  1. the camera position represented by the coordinates x and z according to the left-handed coordinate system of Unity;
  2. the camera orientation
    vector (u, v) which represent the rotation angle along the y-axis.





Agent height 1.5 m 1.6 m 1.7 m Overall images
# of Frames 301, 757 296, 164 288, 902 886, 823


Training

We splited our Dataset in three parts in order to train a CNN:

  • the Training set contains the images of the three agents from Path0 to Path17;
  • the Validation set contains the images of the three agents from Path18 to Path23;
  • the Test set contains the images of the three agents from Path24 to Path29;


We cast the Image Based Localization (IBL) as an image retrieval problem and use a Triplet Network to learn a suitable representation space for images. The network is trained using triplets, each comprising of: 1) the anchor frame I; 2) a similar image I + and 3) a dissimilar image I -.
To generate the triplets for each frame we studied different thresholds for Euclidean distance (2m, 1m, 0.75m, 0.5m, 0.25m) and for orientation distance (60°, 45°, 30°). We sampled 5000 triplets for the Validation set, while for the Training set we generate the triplet at training time (at each epoch). We sample different numbers of images (10000; 20000; 30000; 40000) from the Training set in order to perform a study about the dimensionality of the Training set.

You can download the Stanford dataset at this link


People