Egocentric Visitor Localization and Artwork Detection in Cultural Sites Using Synthetic Data

Bellomo Dataset

The dataset has been generated using the NEW tool for Unity 3D proposed in our paper:
S. A. Orlando, A. Furnari, G. M. Farinella - Egocentric Visitor Localization and Artwork Detectionin Cultural Sites Using Synthetic Data. Submitted in Pattern Recognition Letters on Pattern Recognition and Artificial Intelligence Techniques for Cultural Heritage special issue, 2020.

The tool used to generate the dataset is available at this link ).

The dataset has been generated using the "Galleria Regionale Palazzo Bellomo" 3D model scanned with Matterport. By using this model, we simulated egocentric navigations of 4 navigations with a first clockwise navigation and a second counterclowise navigation in accord to the room layout of the museum. In each room the virtual agent have to visit 5 observation points situated in front of each artworks of the museum.
We acquired (at 5 fps) 99, 769 images. The tool automatically labels each frame according to the 6 Degrees of Freedom (6DoF) of the camera: 1) camera position (x, y, z) and 2) camera orientation in quaternions (w, p, q, r).
Specifically we converted the 6DoF of the camera pose in 3DoF format taking in consideration:

the camera position represented by the coordinates x and z according to the left-handed coordinate system of Unity;
the camera orientation
vector (u, v) which represent the rotation angle along the y-axis.

This figure shows an example of navigation schema performed by using the second version of the tool.

Navigations	1	2	3	4	Overall images
# of Frames	24, 525	25, 003	26, 281	23, 960	99, 769

To evaluate the quality of proposed dataset in term of usefulness fo localization task, for each test sample we perform an optimal nearest neighbor search associating it to the closest pose in the training set.

We also used the tool to label 16 artworks to experiment techniques of artwork detection in cultural sites. Examples of RGB frame with respective semantic mask for each artwork are show below.

This figure reports an example for each artwork of RGB frame and respective semantic mask.

List of artworks labeled in the Bellomo dataset, with the related coordinates in the 3D model and the RGB color codes used to produce the semantic masks.

You can download the dataset at this link

Training

We considered as training set the frames of 2^nd and 3^rd navigations. The Training set has been divided in subset each with 25%, 50%, 75% and 100% of the frames. We cast the Image Based Localization (IBL) as an image retrieval problem and use a Triplet Network to learn a suitable representation space for images. The network is trained using triplets, each comprising of: 1) the anchor frame I; 2) a similar image I ⁺ and 3) a dissimilar image I ^-. The triplets have been generated using: 1) as threshold for Euclidean distance 0.5 m, and 2) as threshold for Orientation distance 45^°. We trained the network for 100 epochs, one model for each subset. We considered as Test set the frames of 1^st navigation. We investigated techniques of temporal smoothing by applying three different filter: mean, median and 25% trimmed mean.

Stanford Dataset

The dataset has been generated using the tool for Unity 3D proposed in our paper:
S. A. Orlando, A. Furnari, S. Battiato, G. M. Farinella - Image Based Localization with Simulated Egocentric Navigations. In International Conference on Computer Vision Theory and Applications (VISAPP), 2019.

The tool is available at this link ).

The dataset has been generated using the "Area 3" 3D model from the S3DIS Dataset. By using this model, we simulated egocentric navigations of 90 paths, 30 paths for each height of the virtual agents that performed the navigations [150cm, 160cm, 170cm].
We acquired (at 30 fps) 886, 823 images. The tool automatically labels each frame according to the 6 Degrees of Freedom (6DoF) of the camera: 1) camera position (x, y, z) and 2) camera orientation in quaternions (w, p, q, r).
We converted the 6DoF of the camera in a 3DoF format taking in consideration:

the camera position represented by the coordinates x and z according to the left-handed coordinate system of Unity;
the camera orientation
vector (u, v) which represent the rotation angle along the y-axis.

Agent height	1.5 m	1.6 m	1.7 m	Overall images
# of Frames	301, 757	296, 164	288, 902	886, 823

Training

We splited our Dataset in three parts in order to train a CNN:

the Training set contains the images of the three agents from Path₀ to Path₁₇;
the Validation set contains the images of the three agents from Path₁₈ to Path₂₃;
the Test set contains the images of the three agents from Path₂₄ to Path₂₉;

We cast the Image Based Localization (IBL) as an image retrieval problem and use a Triplet Network to learn a suitable representation space for images. The network is trained using triplets, each comprising of: 1) the anchor frame I; 2) a similar image I ⁺ and 3) a dissimilar image I ^-.
To generate the triplets for each frame we studied different thresholds for Euclidean distance (2m, 1m, 0.75m, 0.5m, 0.25m) and for orientation distance (60°, 45°, 30°). We sampled 5000 triplets for the Validation set, while for the Training set we generate the triplet at training time (at each epoch). We sample different numbers of images (10000; 20000; 30000; 40000) from the Training set in order to perform a study about the dimensionality of the Training set.

You can download the Stanford dataset at this link

People

Santi Andrea Orlando

Antonino Furnari

Giovanni Maria Farinella