Image-based Navigation in Real-world Environments
via Multiple Mid-level Representations: Fusion
Models Benchmark and Efficient Evaluation

Marco Rosano1,3
Antonino Furnari1,5
Luigi Gulino3
Corrado Santoro2
Giovanni Maria Farinella1,4,5

1 FPV@IPLAB - Departement of Mathematics and Computer Science, University of Catania, Catania, Italy
2 Robotics Laboratory - Departement of Mathematics and Computer Science, University of Catania, Catania, Italy
3 OrangeDev s.r.l., Firenze, Italy
4 Cognitive Robotics and Social Sensing Laboratory, ICAR-CNR, Palermo, Italy
3 Next Vision s.r.l., Catania, Italy

Navigating complex indoor environments requires a deep understanding of the space the robotic agent is acting into to correctly inform the navigation process of the agent towards the goal location. In recent learning-based navigation approaches, the scene understanding and navigation abilities of the agent are achieved simultaneously by collecting the required experience in simulation. Unfortunately, even if simulators represent an efficient tool to train navigation policies, the resulting models often fail when transferred into the real world, mainly because they are not able to capture and generalize the key properties of the scene, giving rise to a domain gap issue. One possible solution is to provide the navigation model with mid-level visual representations containing important domain-invariant properties of the scene. But, what are the best representations that facilitate the transfer of a model to the real-world? How can representations be combined to provide the most useful information to the navigation model? In this work we address these issues by proposing a benchmark of Deep Learning architecture to combine a range of mid-level visual representations, to perform a PointGoal navigation task following a Reinforcement Learning setup. All the proposed navigation models have been trained with the Habitat simulator on a synthetic office environment and have been tested on the same real-world environment using a real robotic platform. Moreover, to efficiently assess their performance in a real context, a validation tool has been proposed to generate realistic navigation episodes inside the simulator, avoiding the deployment of the navigation models in the real world. Our experiments showed that navigation models can benefit from the multi-modal input and that our validation tool can provide good estimation of the expected navigation performance in the real world, while saving time and resources.

Navigation episodes demo (conference paper)

Source code

We have released the code to train visual navigation models following the proposed approach. We hope it can turn out to be useful to train navigation models that can be successfully deployed in the real world.

Code of the extended journal paper: [GitHub]
Code of the conference paper: [GitHub]


We have released the 3D model, the real-world images of the proposed office environment (the "OrangeDev" environment) and the trajectories used to train and test the navigation models.

Dataset of the extended journal paper
In the updated work we used a new set of real-world images and we proposed a range of new visual navigation models, to combine multiple mid-level representations that capture different visual properties of the scene. The new dataset includes the updated checkpoints of the DA models and the checkpoints of the best performing modality fusion models, together with new real-world observations and with additional updates to improve the models' training and testing efficiency.

[3D + New Images + Trajectories]
[Checkpoints of the navigation models]

Dataset of the conference paper
This dataset includes the pre-trained model for Domain Adaptation (DA) and the pre-trained model with CycleGAN.
For more information on how to use the data with the Habitat Simulator, please take a look at the user guide on the GitHub project page.

[3D + Images + Trajectories + Pre-trained model for DA]
[CycleGAN Sim2Real pre-trained checkpoint]

Papers and Bibtex

Journal paper
Rosano, M., Furnari, A., Gulino, L., Santoro, C., and Farinella G.M., 2022.
Image-based Navigation in Real-World Environments via Multiple Mid-levelRepresentations: Fusion Models Benchmark and Efficient Evaluation. [LINK]

                  title={Image-based Navigation in Real-World Environments via Multiple Mid-level Representations: Fusion Models, Benchmark and Efficient Evaluation}, 
                  author={Marco Rosano and Antonino Furnari and Luigi Gulino and Corrado Santoro and Giovanni Maria Farinella},

Conference paper
Rosano, M., Furnari, A., Gulino, L., and Farinella G.M., 2020.
On Embodied Visual Navigation in Real Environments Through Habitat.

In International Conference on Pattern Recognition (ICPR).

  title={On Embodied Visual Navigation in Real Environments Through Habitat},
  author={Rosano, Marco and Furnari, Antonino and
            Gulino, Luigi and Farinella, Giovanni Maria},
  booktitle={International Conference on Pattern Recognition (ICPR)},


This research is supported by OrangeDev s.r.l., by Next Vision s.r.l., the project MEGABIT - PIAno di inCEntivi per la RIcerca di Ateneo 2020/2022 (PIACERI) – linea di intervento 2, DMI - University of Catania, and the grant MIUR AIM - Attrazione e Mobilità Internazionale Linea 1 - AIM1893589 - CUP E64118002540007.

Website template from here and here.