Virtual to Real Unsupervised Domain Adaptation for Image-Based Localization in Cultural Sites


Santi Andrea Orlando1, 2, **, Antonino Furnari1, ** and Giovanni Maria Farinella1, 3, **

1Department of Mathematics and Computer Science, University of Catania, IT
2DWORD - Xenia S.r.l., Acicastello, Catania, IT
3National Research Council, ICAR-CNR, Palermo, IT

**These author are co-first authors and contributed equally to this work.



We propose unpaired domain adaptation benchmark for virtual to real image-based localization in cultural sites. We focus on using image-to-image translation and mid-level representation to fill domain gap to localizae First Person Vision navigation in Galleria Regionale Palazzo Bellomo, located in Siracusa, Italy. The contribution of this work are the following:

  • we benchmark two methodologies to study unpaired domain adaptation for Image-Base localization for image-based localization in cultural sites using virtual and real dataset collected from the same environment;
  • we proposed a dataset having both virtual and real data of egocentric navigations collected from 3D model (virtual) and Hololens (real);
  • a benckmark study of context aware and 3DOF indoor localization to asses the effectiveness of benchmarked methods to domain adaptation using image-to-image translation and mid-level representation to train a CNN for metric learning with a triplet network and an image-retrieval procedure;

Results

Room-Based Localization


Table 1: Room-based localization results considering defferent combinations of RGB images and mid-level representations. Results are repoted with and without image-to-image translation.


Table 2: Accuracy and F1 scores obtained by the compared methods on each class.


3DOF Camera Pose Localization

Table 3: 3DOF localization results considering defferent combinations of RGB images and mid-level representations. Results are repoted with and without image-to-image translation.


Table 4: Summary of 3DOF localization results.



Datasets

- Virtual dataset - Bellomo Dataset


For virtual domain we used dataset collected from the 3D model of the cultural site Pallazzo Bellomo acquired with Matterport 3D scanner. The dataset comprises of 4 simulated navigations generated simulating visits inside the museum. The video has been acquired at 5 frames per second. We used the 3DOF camara pose and current room (Context), to study image-based localization and classification of the 11 context of the museum.

For more details about this dataset go to this page .


You can download the Bellomo dataset at this link .


- Real dataset - EGO-CH


Real images of the same cultural site are the one of EGO-CH dataset. This data has been collected by visitors wearing Microsoft HoloLens device. We used 10 video sequences labeled with the room location. The frames has been extracted at 5 frame per second selecting only the frame related to the 1st floor of the museum.

For more details about this dataset go to this page .


You can download the dataset at this link .


Fig. 1: Examples of mid level representation extracted with the corresponding RGB images.

Method



We propose a dataset to study virtual to real unpaired domain adaptation for image-based localization problem in cultural sites. The proposed strategies for domain adaptation are the following:
  1. Using mid-level representation and conbining information with RGB images;
  2. Performing image-to-image translation with CycleGAN and ToDay;
  3. Combining these methods together to further outperform the results.
We propose to use two dataset from 3D model of a real cultural heritage and use this data as source virtual domain. We also propose to use subset of EGO-CH dataset as target real domain. Thee performed experiments aim to solve context aware and 3DoF localization in indoor environment.

Image-Based localization approach consists of an image retrieval pipeline that feed a Triplet Network with Inception V3 backbone to learn useful representation for indoor localization task. We trained our network for each representation (2D/3D edges, 2D/3D keypoints and Depth (i.e., Euclidean Distance)) extracted with pretrained model of Taskonomy. At inference time we used 1-nearest Neighbor search to assign a 3DoF label and context for each query image of real domain, using like search space images from virtual domain.

To label EGO-CH dataset with 3DOF camera pose we run COLMAP for a subset of the 11 context of Palazzo Bellomo. The subset includes the contexts Sala 5, Sala 7, Sala 9 and Sala 13. COLMAP 3D reconstrucions returns 932 labeled images, that has been geo-registered with the 3D model of the virtual Bellomo dataset.

Fig. 2: Illustration of proposed pipeline for domain adaptation.

Paper

S. A. Orlando, A. Furnari, G. M. Farinella - Virtual to Real Unsupervised Domain Adaptationfor Image-Based Localization in Cultural Sites - In Fourth IEEE International Conference on Image Processing, Applications and Systems (IPAS), 2020. Download.

Supplementary Material

Download

Acknowledgement

This research is supported by XENIA Progetti - DWORD, by the project VALUE - Visual Analysis for Localization and Understanding of Environments (N. 08CT6209090207, CUP G69J18001060007) granted by PO FESR 2014/2020 - Azione 1.1.5 - "Sostegno all’avanzamento tecnologico delle imprese attraverso il finanziamento di linee pilota e azioni di validazione precoce dei prodotti e di dimostrazioni su larga scala'', and by Piano della Ricerca 2016-2018 linea di Intervento 2 of DMI, University of Catania. The authors would like to thank Regione Siciliana Assessorato dei Beni Culturali dell'Identità Siciliana - Dipartimento dei Beni Culturali e dell'Identità Siciliana and Polo regionale di Siracusa per i siti culturali - Galleria Regionale di Palazzo Bellomo.

People