We collected 5,000 frames at 1fps for each episode-view pair.
The dataset contains 30,000 frames in total. Each image has a resolution of 800x600 pixels. The time of the day
and weather has been generated randomly for each episode. Each image of the
dataset is associated to a ground truth semantic segmentation mask produced by the simulator.
The 13 annotated classes are: buildings,
fences, other, pedestrians, poles, road-lines, roads,
sidewalks, vegetation, vehicles, walls, traffic signs and none.
The collected dataset allowed us to consider two different kinds of source-target domain pairs:
- point of view adaptation pairs, composed by two subsets from the same scene context but with different views,
- scene adaptation pairs, composed by two subsets belonging to different scene