A Multi Camera Unsupervised Domain Adaptation Pipeline for Object Detection in Cultural Sites through Adversarial Learning and Self-Training

FPV@IPLAB - Department of Mathematics and Computer Science, University of Catania, Italy

Giovanni Pasqualino, Antonino Furnari, Giovanni Maria Farinella

Object detection algorithms allow to enable many interesting applications which can be implemented in different devices, such as smartphones and wearable devices. In the context of a cultural site, implementing these algorithms in a wearable device, such as a pair of smart glasses, allow to enable the use of augmented reality (AR) to show extra information about the artworks and enrich the visitors’ experience during their tour. However, object detection algorithms require to be trained on many well annotated examples to achieve reasonable results. This brings a major limitation since the annotation process requires human supervision which makes it expensive in terms of time and costs. A possible solution to reduce these costs consist in exploiting tools to automatically generate synthetic labeled images from a 3D model of the site. However, models trained with synthetic data do not generalize on real images acquired in the target scenario in which they are supposed to be used. Furthermore, object detectors should be able to work with different wearable devices or different mobile devices, which makes generalization even harder. In this paper, we present a new dataset collected in a cultural site to study the problem of domain adaptation for object detection in the presence of multiple unlabeled target domains corresponding to different cameras and a labeled source domain obtained considering synthetic images for training purposes. We present a new domain adaptation method which outperforms current state-of-the-art approaches combining the benefits of aligning the domains at the feature and pixel level with a self-training process.


We propose a dataset of synthetic and real images related to 16 artworks present in "Galleria regionale Palazzo Bellomo" located in Siracusa, Italy. The dataset contains two set of images, synthetic and real which are divided has follows:

Synthetic Dataset

  • Training set: 51284 images
  • Validation set: 24525 images
  • Test set: 23960 images
  • Real Hololens Dataset

  • Training set: 1502 images
  • Test set: 688 images
  • Real GoPro Dataset

  • Training set: 1911 images
  • Test set: 796 images
  • Instances Distributions

    The average occupied area (last column) is the average percentage of the image occupied by the bounding boxes of the considered object class.

    You can download the whole dataset and annotations at this link


    We explore the following methods:
    1) baseline approaches without adaption;
    2) domain adaptation through image to image translation;
    3) domain adaptation through feature alignment;
    4) new multi target domain adaptation method STMDA-RetinaNet (see the figure below);
    5) domain adaptation combining feature alignment and image to image translation.

    Step 1

    Architecture of the proposed MDA-RetinaNet model.

    Step 2

    Self-training module for MDA-RetinaNet.


    The STMDA-RetinaNet architecture code is available at this link


    Qualitative Results


    G. Pasqualino, A. Furnari, G. M. Farinella, A Multi Camera Unsupervised Domain Adaptation Pipeline for Object Detection in Cultural Sites through Adversarial Learning and Self-Training, Computer Vision and Image Understanding, 2022 Paper arXiv

     title = {A multi camera unsupervised domain adaptation pipeline for object detection in cultural sites through adversarial learning and self-training},
     journal = {Computer Vision and Image Understanding},
     pages = {103487},
     year = {2022}, issn = {1077-3142},
     doi = {https://doi.org/10.1016/j.cviu.2022.103487},
     url = {https://www.sciencedirect.com/science/article/pii/S1077314222000911},
     author = {Giovanni Pasqualino and Antonino Furnari and Giovanni Maria Farinella}

    Conference Paper

    G. Pasqualino, A. Furnari, G. M. Farinella, "Unsupervised Multi-camera Domain Adaptation for Object Detection in Cultural Sites", International Conference on Image Analysis and Processing, 2022 Paper.


    This research has been supported by the project VALUE (N. 08CT6209090207 - CUP G69J18001060007) - PO FESR 2014/2020 - Azione 1.1.5., by Research Program Pia.ce.ri. 2020/2022 Linea 2 - University of Catania, and by MIUR AIM - Linea 1 - AIM1893589 - CUP E64118002540007.