Virtual to Real Unsupervised Domain Adaptation for Image-Based Localization in Cultural Sites


Santi Andrea Orlando1, 2, **, Antonino Furnari1, ** and Giovanni Maria Farinella1, 3, **

1Department of Mathematics and Computer Science, University of Catania, IT
2DWORD - Xenia S.r.l., Acicastello, Catania, IT
3National Research Council, ICAR-CNR, Palermo, IT

**These author are co-first authors and contributed equally to this work.



Virtual dataset - Bellomo Dataset


The dataset has been generated using the tool proposed in the paper:

S. A. Orlando, A. Furnari, G. M. Farinella - Egocentric Visitor Localization and Artwork Detectionin Cultural Sites Using Synthetic Data. Submitted in Pattern Recognition Letters on Pattern Recognition and Artificial Intelligence Techniques for Cultural Heritage special issue, 2020.

The dataset has been generated using the "Galleria Regionale Palazzo Bellomo" 3D model scanned with Matterport, and includes simulated egocentric navigations of 4 navigations. Each frames has been labeled with the 3DOF and the room in which the virtual agent is during the acquisition. The context of the museum are 11. The figure below shows the map of the museum.

Fig. 2: Map of Palazzo Bellomo with marked contexts.

We acquired the video at 5 fps for a total of 99, 769 images. The 3DOF camera pose are defined in this format:

  1. the camera position represented by the coordinates x and z according to the left-handed coordinate system of Unity;
  2. the camera orientation
    vector (u, v) which represent the rotation angle along the y-axis.



Navigations 1 2 3 4 Overall images
# of Frames 24, 525 25, 003 26, 281 23, 960 99, 769

Table 1: Dataset detail of Palazzo Bellomo.




You can download the Bellomo Virtual Dataset at this link

Training

We considered as training set the frames of 2nd and 3rd navigations. We cast the Image Based Localization (IBL) as an image retrieval problem and use a Triplet Network to learn a suitable representation space for images. The network is trained using triplets, each comprising of: 1) the anchor frame I; 2) a similar image I + and 3) a dissimilar image I -. The triplets have been generated using: 1) as threshold for Euclidean distance 0.5 m, and 2) as threshold for Orientation distance 45°. We trained the network for 100 epochs, one model for each subset. We considered as Test set the frames of 1st navigation. We also train the network with different mid level representation as technique of domain adaptation. Another method of studied domain adaptation is the unpaired image-to-image translation using CycleGAN and ToDayGAN.
We described below the dataset used as real domain and after show example of extracted mid level representation.

Fig. 3: Examples of mid-level representations extracted with the corresponding RGB images. We report two pairs of virtual and real images.



Real Dataset - EGO-CH


This dataset has been collected with Microsof HoloLens used from volunteer during their visit in the museum. The dataset has benn published in this work:

F. Ragusa, A. Furnari, S. Battiato, G. Signorello, G. M. Farinella. EGO-CH: Dataset and Fundamental Tasks for Visitors Behavioral Understanding using Egocentric Vision . Pattern Recognition Letters - Special Issue on Pattern Recognition and Artificial Intelligence Techniques for Cultural Heritage, 2020.

A subset of the dataset is used to benchmark Virtual to Real domain adaptation in particular, we used the 10 video of Test. We acquired (at 5 fps) and excluded the frame of the ground floor not available in virtual dataset. The total extracted images are 12, 008 labeled with the 11 rooms of the museum.




Fig. 3: Examples of mid-level representations extracted with the corresponding RGB images. We report two pairs of virtual and real images.





Training

We split the Dataset in training set, validation set and test set as follows:

  • the Training set contains the images of the videos from 1 to 6;
  • the Validation set contains the images of the videos 9 and 10;
  • the Test set contains the images of the videos 7 and 8.
The table below resume the amount of images for each set.

Training set Validation set Test set Overall images
# of Frames 24, 357 10, 288 12, 008 46, 653


The label of each frames is related to one of the 11 rooms of the museum.
We used structure from motion algorithm to obtain 3DOF label. We used a subset of the rooms including 1, 766 frames for Sala 5, 1, 597 frames for Sala 7, 1, 570 frames for Sala 9, and 837 for Sala 13. With "geo-registration" function we aligned the COLMAP 3D reconstruction with 3D model of virtual dataset and the process return a total of 932 labeled images.

You can download the EGO-CH Real Dataset at this link

You can download the 3DOF labeled dataset at this link


Paper

S. A. Orlando, A. Furnari, G. M. Farinella - Virtual to Real Unsupervised Domain Adaptationfor Image-Based Localization in Cultural Sites - In Fourth IEEE International Conference on Image Processing, Applications and Systems (IPAS), 2020. Download.

Supplementary Material

Download

Acknowledgement

This research is supported by XENIA Progetti - DWORD, by the project VALUE - Visual Analysis for Localization and Understanding of Environments (N. 08CT6209090207, CUP G69J18001060007) granted by PO FESR 2014/2020 - Azione 1.1.5 - "Sostegno all’avanzamento tecnologico delle imprese attraverso il finanziamento di linee pilota e azioni di validazione precoce dei prodotti e di dimostrazioni su larga scala'', and by Piano della Ricerca 2016-2018 linea di Intervento 2 of DMI, University of Catania. The authors would like to thank Regione Siciliana Assessorato dei Beni Culturali dell'Identità Siciliana - Dipartimento dei Beni Culturali e dell'Identità Siciliana and Polo regionale di Siracusa per i siti culturali - Galleria Regionale di Palazzo Bellomo.

People