Scene Adaptation for Semantic Segmentation using Adversarial Learning

Daniele Di Mauro, Antonino Furnari, Giuseppe Patanè, Sebastiano Battiato, Giovanni M. Farinella

In Proceedings of Advanced Video and Signal Based Surveillance (AVSS), 2018 15th IEEE International Conference on 2018 Nov 29. IEEE.

ABSTRACT

Semantic Segmentation algorithms based on the deep learning paradigm have reached outstanding performances. However, in order to achieve good results in a new domain, it is generally demanded to fine-tune a pre-trained deep architecture using new labeled data coming from the target application domain. The fine-tuning procedure is also required when the domain application settings change, e. g., when a camera is moved, or a new camera is installed. This implies the collection and pixel-wise labeling of images to be used for training, which slows down the deployment of semantic segmentation systems in real industrial scenarios and increases the industrial costs. Taking into account the aforementioned issues, in this paper we propose an approach based on Adversarial Learning to perform scene adaptation for semantic segmentation. We frame scene adaptation as the task of predicting semantic segmentation masks for images belonging to a Target Scene Context given labeled images coming from a Source Scene Context and unlabeled images coming from the Target Scene Context. Experiments highlight that the proposed method achieves promising performances both when the two scenes contain similar content (i.e., they are related to two different points of view of the same scene) and when the observed scenes contain unrelated content (i.e., they account to completely different scenes).
Scene Adaptation for Semantic Segmentation using Adversarial Learning Daniele Di Mauro, Antonino Furnari, Giuseppe Patanè, Sebastiano Battiato, Giovanni M. Farinella In Proceedings of Advanced Video and Signal Based Surveillance (AVSS), 2018 15th IEEE International Conference on 2018 Nov 29. IEEE. Bibtex

VIDEO DEMO

The video shows an example of View Adaptation:

DATASET

We collected 5,000 frames at 1fps for each episode-view pair. The dataset contains 30,000 frames in total. Each image has a resolution of 800x600 pixels. The time of the day and weather has been generated randomly for each episode. Each image of the dataset is associated to a ground truth semantic segmentation mask produced by the simulator. The 13 annotated classes are: buildings, fences, other, pedestrians, poles, road-lines, roads, sidewalks, vegetation, vehicles, walls, traffic signs and none.
The collected dataset allowed us to consider two different kinds of source-target domain pairs:
  1. point of view adaptation pairs, composed by two subsets from the same scene context but with different views,
  2. scene adaptation pairs, composed by two subsets belonging to different scene
Download Dataset (17GB)
Scene 1 - View A (A1)
Scene 1 - View B (B1)
Scene 2 - View A (A2)
Scene 2 - View B (B2)
Scene 3 - View A (A3)
Scene 3 - View B (B3)