Egocentric Visitors Localization in Natural Sites

Filippo Luigi Maria Milotta^1,*, Antonino Furnari^1,*, Sebastiano Battiato¹,
Giovanni Signorello², Giovanni Maria Farinella^1,2

¹IPLab, Department of Mathematics and Computer Science - University of Catania, IT ²CUTGANA - University of Catania, IT
^*These authors contributed equally to this work.

Localizing visitors in natural environments is challenging due to the unavailability of pre-installed cameras or other infrastructure such as WiFi networks. We propose to perform localization using egocentric images collected from the visitor's point of view with a wearable camera. Localization can be useful to provide services to both the visitors (e.g., showing where they are or what to see next) and to the site manager (e.g., to understand what the visitors pay more attention to and what they miss during their visits). We collected and publicly released a dataset of egocentric videos asking 12 subjects to freely visit a natural site. Along with video, we collected GPS locations by means of a smartphone. Experiments comparing localization methods based on GPS and images highlight that image-based localization is much more reliable in the considered domain and small improvements can be achieved by combining GPS- and image-based predictions using late fusion.

In sum, the contributions of this work are as follows:

We propose a dataset of egocentric videos collected in a natural site for the purpose of visitor localization. The dataset has been collected by 12 subjects, contains about 6 hours of recording, and is labeled for the visitor localization task. To our knowledge this dataset is the first of its kind and we hope that it can be valuable to the research community;
We compare methods based on GPS and vision on the visitor localization task in a natural site. Our experiments show that image-based approaches are accurate enough to address the considered task, while GPS-based approaches tend to achieve reasonable results;
We investigate the benefits of fusing GPS and vision to improve localization accuracy. Specifically, our experiments suggest that better results can be obtained by fusing the two modalities, which encourages further research in this direction.

Results

Firstly, we report the results of the methods based on visual and GPS data when used separately. Results are shown in terms of accuracy%, according to the three defined levels of localization granularity: 9 Context, 9 Subcontext, and 17 Subcontext. All results are computed independently on the three folds defined in EgoNature dataset, and then averaged. The best average result among the image-based methods are reported in bold, whereas the best average results among the GPS-based methods are underlined. As shown by the table, KNN is the best performing method among the ones based on GPS, followed by SVM and DCT. Analogously, among the image-based classifiers, VGG16 is the best performing method, followed by SqueezeNet and AlexNet. This behavior is consistent for all folds and localization granularities.

		Fold 1	Fold 2	Fold 3	AVG
9 Contexts	DTC	78.76%	46.83%	76.53%	67.37%
	SVM	78.89%	52.59%	78.50%	69.99%
	KNN	80.38%	54.34%	81.05%	71.92%
	AlexNet	90.99%	90.72%	89.63%	90.45%
	SqueezeNet	91.24%	93.08%	91.40%	91.91%
	VGG16	94.26%	95.59%	94.08%	94.64%
9 Subcontexts	DTC	55.26%	40.47%	38.63%	44.78%
	SVM	52.60%	43.03%	49.85%	48.49%
	KNN	59.71%	43.58%	51.61%	51.63%
	AlexNet	82.66%	84.68%	81.94%	83.09%
	SqueezeNet	83.96%	88.06%	84.21%	85.41%
	VGG16	90.68%	91.47%	87.18%	89.78%
17 Contexts	DTC	65.03%	37.25%	64.95%	55.74%
	SVM	66.71%	42.10%	64.62%	57.81%
	KNN	67.50%	43.84%	65.75%	59.03%
	AlexNet	84.73%	87.71%	84.77%	85.74%
	SqueezeNet	87.94%	91.07%	87.42%	88.81%
	VGG16	91.35%	94.03%	91.10%	92.16%

We summarize the improvements obtained by late fusion approach with respect to image-based approaches. The comparisons are performed considering the classifiers which fuse CNNs with DCT, SVM and KNN with a late fusion weight w=4 (i.e., "4*Alexnet + DCT+SVM+KNN", "4*Squeezenet + DCT+SVM+KNN", "4*VGG16 + DCT+SVM+KNN"). It should be noted that improvements are larger for the coarse 9 Contexts granularity, in which even inaccurate GPS positions can be useful.

	AlexNet			SqueezeNet			VGG16
	AlexNet	LF	Imp.	SqueezeNet	LF	Imp.	VGG16	LF	Imp.
9 Contexts	90.45	91.76	1.31	91.91	93.07	1.16	94.64	95.14	0.50
9 Subcontexts	83.09	84.12	1.03	85.41	86.33	0.92	89.78	90.12	0.34
17 Contexts	85.74	87.15	1.41	88.81	89.82	1.01	92.16	92.75	0.59
Avg Imp.			1.25			1.03			0.48

We have also performed experiments to understand the computational effort needed when employing the different approaches. For each method we report the time needed to process a single image in milliseconds and the memory required by the model. The table shows that, despite the low accuracy, the methods based on GPS are extremely efficient both in terms of required memory and time. Among these methods, DCT and KNN are particularly efficient in terms of memory. On the contrary, methods based on CNNs require more memory and processing time. Specifically, VGG16 is the slowest and heaviest method, while Squeezenet is the fastest and most compact one. Note that, when late fusion is considered, the overhead required to obtain GPS-based is negligible due to the high computational efficiency of GPS-based methods.

Method	Time (ms)	Mem (MB)
DCT	0.01	0.07
SVM	0.01	0.33
KNN	0.01	0.12
AlexNet	20.25	217.60
SqueezeNet	18.30	2.78
VGG16	434.86	512.32

EgoNature Dataset

The dataset used in this work has been collected asking 12 volunteers to visit the Botanical Garden of the University of Catania. The garden has a length of about 170m and a width of about 130m. In accordance with experts, we defined 9 contexts and 9 Subcontexts of interest which are relevant to collect behavioral information from the visitors of the site. We asked each volunteer to explore the natural site wearing a camera and a smartphone during their visit. The wearable camera has been used to collect egocentric videos of the visits, while the smartphone has been used to collect GPS locations. GPS data has been later synced with the collected video frames.

We collected and labeled almost 6 hours of recording, from which we sampled a total of 63,581 frames for our experiments. The selected frames have been resized to 128x128 pixels to decrease the computational load.

For more details about our dataset go to this page .

You can download the dataset annotated with GPS coordinates at this link .

Methods

We consider localizing the visitor of the natural site as a classification problem. In particular, we investigate classification approaches based on GPS and images, as well as methods jointly exploiting both modalities. Each of the considered methods is trained and evaluated on the proposed EgoNature dataset according to the three defined levels of localization granularity: 9 Context, 9 Subcontext, and 17 Subcontext.

Localization Methods Based on GPS: when GPS coordinates are available, localization can be performed directly by analyzing such data. We employed different popular classifiers: Decision Classification Trees (DCT), Support Vector Machines (SVM) with linear and RGB kernels and k-nearest neighbor (KNN). Each of the considered approaches takes the raw GPS coordinates as input and produces a probability distribution over the considered classes as output. We tune the hyperparameters of the classifiers performing a grid search with cross validation. We remove duplicate GPS measurements from the training set when train GPS-based methods, as we noted that this improved performance. Please note that, for fair comparison, duplicate measurements are not removed at test time.

Localization Methods Based on Images: we consider three different CNN architectures: AlexNet, Squeezenet and VGG16. The three architectures achieve different performances on the ImageNet classification task and require different computational budgets. In particular, AlexNet and VGG16 require the use of a GPU at inference time, whereas SqueezeNet is extremely lightweight and has been designed to run on CPU (i.e., can be exploited on mobile and wearable devices). We initialize each network with ImageNet-pretrained weights and fine-tune each architecture for the considered classification tasks.

Localization Methods Exploiting Joinlty Images and GPS: images and GPS are deemed to carry complementary information. Specifically, even inaccurate GPS measurements can be useful as a rough location prior, while images can be used to distinguish neighboring areas characterized by different visual appearance and geometry. We investigate how the complementary nature of these two signals can be leveraged using late fusion during classification. Late fusion can be seen as a kind of weighted average between the vectors of class scores produced by image- and GPS-based methods.

Paper

F. L. M. Milotta, A. Furnari, S. Battiato, G. Signorello, G. M. Farinella. Egocentric visitors localization in natural sites. Journal of Visual Communication and Image Representation, 2019. Download the paper here.

Patent

G. M. Farinella, G. Signorello, A. Furnari, S. Battiato, E. Scuderi, A. Lopes, L. Santo, M. Samarotto, G. P. A. Distefano, D. G. Marano, “Integrated Method With Wearable Kit For Behavioural Analysis And Augmented Vision" (Italian Version: "Metodo Integrato con Kit Indossabile per Analisi Comportamentale e Visione Aumentata"), Patent Application number 102018000009545, filling date: 17/10/2018, Università degli Studi di Catania, Xenia Gestione Documentale S.R.L., IMC Service S.R.L.

Acknowledgement

This research is supported by PON MISE – Horizon 2020, Project VEDI - Vision Exploitation for Data Interpretation, Prog. n. F/050457/02/X32 - CUP: B68I17000800008 - COR: 128032, and Piano della Ricerca 2016-2018 linea di Intervento 2 of DMI of the University of Catania.