We collected 5,000 frames at 1fps for each episode-view pair.
  The dataset contains 30,000 frames in total. Each image has a resolution of 800x600 pixels. The time of the day
  and weather has been generated randomly for each episode. Each image of the
  dataset is associated to a ground truth semantic segmentation mask produced by the simulator.
  The 13 annotated classes are: buildings,
  fences, other, pedestrians, poles, road-lines, roads,
  sidewalks, vegetation, vehicles, walls, traffic signs and none.
  
  The collected dataset allowed us to consider two different kinds of source-target domain pairs:
  
- point of view adaptation pairs, composed by two subsets from the same scene context but with different views,
 
  - scene adaptation pairs, composed by two subsets belonging to different scene