C. Quattrocchi1, D. Di Mauro1, A. Furnari1,2, G. M. Farinella1,2
Being able to understand the relations between the user and the surrounding environment is instrumental to assist users in a worksite. For instance, understanding which objects a user is interacting with from images and video collected through a wearable device can be useful to inform the worker on the usage of specific objects in order to improve productivity and prevent accidents. Despite modern vision systems can rely on advanced algorithms for object detection, semantic and panoptic segmentation, these methods still require large quantities of domain-specific labeled data, which can be difficult to obtain in industrial scenarios. Motivated by this observation, we propose a pipeline which allows to generate synthetic images from 3D models of real environments and real objects. The generated images are automatically labeled and hence effortless to obtain. Exploiting the proposed pipeline, we generate a dataset comprising synthetic images automatically labeled for panoptic segmentation. This set is complemented by a small number of manually labeled real images for fine-tuning. Experiments show that the use of synthetic images allows to drastically reduce the number of real images needed to obtain reasonable panoptic segmentation performance.
Dataset generation pipeline
To study the considered problem, we have created a dataset comprised of two parts: real images with segmented masks manually annotated and synthetic images with automatically generated annotations.
Red box: generation of the real dataset: (1) acquisition of real images using HoloLens2; (2) extraction of frames and related camera poses; (3) annotation of the segmentation masks.
Blue box: generation of the synthetic dataset: (4) acquisition of the 3D model using a Matterport3D scanner; (5) generation of the 3D model; (6) semantic labelling of the 3D model using Blender; (7) generation of a random tour inside the 3D model; (8) generation of synthetic frames and semantic labels. (Rendering through Blender) the 3D model and the positions are processed by a script for the generation of frames and semantic labels; (Conversion in COCO format) semantic labels are processed by a script for extracting JSON annotations in COCO format.
C. Quattrocchi, D. Di Mauro, A. Furnari, G. M. Farinella. Panoptic Segmentation in Industrial Environments using Synthetic and Real Data. International Conference on Image Analysis and Processing (ICIAP) 2021. Download the paper.