SceneAdapt: Semantic Segmentation Adaptation Through Adversarial Learning

ABSTRACT

Semantic segmentation methods have achieved outstanding performance thanks to deep learning. Nevertheless, when such algorithms are deployed to new contexts not seen during training, it is necessary to collect and label scene-specific data in order to adapt them to the new domain using fine-tuning. This process is required whenever an already installed camera is moved or a new camera is introduced in a camera network due to the different scene layouts induced by the different viewpoints. To limit the amount of additional training data to be collected, it would be ideal to train a semantic segmentation method using labeled data already available and only unlabeled data coming from the new camera. We formalize this problem as a domain adaptation task and introduce a novel dataset of urban scenes with the related semantic labels. As a first approach to address this challenging task, we propose SceneAdapt, a method for scene adaptation of semantic segmentation algorithms based on adversarial learning. Experiments and comparisons with state-of-the-art approaches to domain adaptation highlight that promising performance can be achieved using adversarial learning both when the two scenes have different but points of view, and when they comprise images of completely different scenes.

Daniele Di Mauro, Antonino Furnari, Giuseppe Patanè, Sebastiano Battiato, Giovanni Maria Farinella, SceneAdapt: Scene-based domain adaptation for semantic segmentation using adversarial learning, Pattern Recognition Letters, Volume 136, Pages 175-182, 2020 Bibtex

VIDEO DEMO

The video shows an example of View Adaptation:

the Target Domain image;
the Ground Thruth (GT);
the No Adaptation baseline (NA);
the Fine-Tuning (FT);
Our method (OUR);
Learning to Adapt Structured Output Space for Semantic Segmentation (ASEGNET)
Learning from Synthetic Data: Addressing Domain Shift for Semantic Segmentation (LSD)

DATASET

We collected 5,000 frames at 1fps for each episode-view pair. The dataset contains 30,000 frames in total. Each image has a resolution of 800x600 pixels. The time of the day and weather has been generated randomly for each episode. Each image of the dataset is associated to a ground truth semantic segmentation mask produced by the simulator. The 13 annotated classes are: buildings, fences, other, pedestrians, poles, road-lines, roads, sidewalks, vegetation, vehicles, walls, traffic signs and none.
The collected dataset allowed us to consider two different kinds of source-target domain pairs:

point of view adaptation pairs, composed by two subsets from the same scene context but with different views,
scene adaptation pairs, composed by two subsets belonging to different scene

Download Dataset (17GB)