Egocentric Point of Interest Recognition in Cultural Sites


Francesco Ragusa1,2, Antonino Furnari1, Sebastiano Battiato1, Giovanni Signorello3, Giovanni Maria Farinella1,3

1IPLab, Department of Mathematics and Computer Science - University of Catania, IT
2XGD - Xenia Progetti s.r.l., Acicastello, Catania, IT
3CUTGANA - University of Catania, IT



We consider the problem of the detection and recognition of points of interest in cultural sites. We observe that a ``point of interest" in a cultural site may be either an object or an environment and highlight that the use of an object detector is beneficial to recognize points of interest which occupy a small part of the frame. The contributions of this work are the following:

  • The observation of the dual nature of point of interest in a cultural site, including objects and environments;
  • The extension of the UNICT-VEDI dataset with bounding box annotations;
  • A comparison of approaches based on whole scene processing with respect to object detection to recognize points of interest in cultural sites.


Results



A video example of our object-based approach in action is shown below. Quantitative results are summarized in the following table (see the paper for more details).



Comparison of the three temporal approach and the object based method
Class 56-POI 56-POI-N 9-Classifiers object-based
1.1 Ingresso 0,70 0,68 0.68 0,18
2.1 RampaS.Nicola 0,58 0,57 0,64 0,29
2.2 RampaS.Benedetto 0,29 0,28 0,55 0,17
3.1 SimboloTreBiglie 0,00 0,00 0,00 0,00
3.2 ChiostroLevante / / / /
3.3 Plastico / / / /
3.4 Affresco 0,48 0,49 0,50 0,61
3.5 Finestra_ChiostroLevante 0,00 0,00 0,00 0,56
3.6 PortaCorodiNotte 0,73 0,70 0,76 0,28
3.7 TracciaPortone 0,00 0,00 0,93 0,73
3.8 StanzaAbate / / / /
3.9 CorridoioDiLevante 0,60 0,49 0,81 0,04
3.10 CorridoioCorodiNotte 0,76 0,88 0,92 0,61
3.11 CorridoioOrologio 0,67 0,67 0,81 0,25
4.1 Quadro 0,91 0,92 0,79 0,90
4.2 PavimentoOriginaleAltare 0,44 0,64 0,46 0,70
4.3 BalconeChiesa 0,87 0,82 0,86 0,65
5.1 PortaAulaS.Mazzarino 0,46 0,59 0,48 0,72
5.2 PortaIngressoMuseoFabbrica 0,37 0,42 0,91 0,55
5.3 PortaAntirefettorio 0,00 0,00 0,40 0,79
5.4 PortaIngressoRef.Piccolo 0,00 0,00 0,00 0,85
5.5 Cupola 0,91 0,49 0,87 0,99
5.6 AperturaPavimento 0,95 0,94 0,94 0,96
5.7 S.Agata 0,97 0,97 0,97 1,00
5.8 S.Scolastica 0,96 0,99 0,85 0,93
5.9 ArcoconFirma 0,72 0,83 0,77 0,83
5.10 BustoVaccarini 0,87 0,94 0,88 0,90
6.1 QuadroSantoMazzarino 0,96 0,81 0,68 0,82
6.2 Affresco 0,89 0,89 0,96 0,97
6.3 PavimentoOriginale 0,92 0,89 0,96 0,97
6.4 PavimentoRestaurato 0,48 0,60 0,74 0,37
6.5 BassorilieviMancanti 0,77 0,61 0,88 0,76
6.6 LavamaniSx 0,82 0,81 0,99 0,97
6.7 LavamaniDx 0,00 0,00 0,98 0,94
6.8 TavoloRelatori 0,88 0,69 / 0,80
6.9 Poltrone 0,56 0,87 0,47 0,29
7.1 Edicola 0,70 0,77 0,86 0,84
7.2 PavimentoA 0,00 0,00 0,42 0,23
7.3 PavimentoB 0,00 0,00 0,00 0,07
7.4 PassavivandePavimentoOriginale 0,57 0,58 0,68 0,80
7.5 AperturaPavimento 0,83 0,82 0,80 0,71
7.6 Scala 0,59 0,68 0,86 0,88
7.7 SalaMetereologica 0,76 0,75 0,98 0,44
8.1 Doccione 0,79 0,80 0,86 0,74
8.2 VanoRaccoltaCenere 0,35 0,40 0,47 0,43
8.3 SalaRossa 0,73 0,81 0,84 0,52
8.4 ScalaCucina 0,68 0,72 0,60 0,71
8.5 CucinaProvv. 0,66 0,62 0,81 0,83
8.6 Ghiacciaia 0,43 0,95 0,69 0,45
8.7 Latrina 0,98 0,98 0,99 0,89
8.8 OssaeScarti 0,64 0,77 0,72 0,70
8.9 Pozzo 0,41 0,90 0,94 0,91
8.10 Cisterna 0,13 0,00 0,00 0,48
8.11 BustoPietroTacchini 0,95 0,97 0,99 0,85
9.1 NicchiaePavimento 0,73 0,75 0,95 0,75
9.2 TraccePalestra 0,79 0,91 0,28 0,86
9.3 PergolatoNovizi 0,75 0,69 / 0,73
Negative 0,46 0,62 0,60 0,52
mff1 0,59 0,62 0.66 0,65




Dataset







We extended the UNICT-VEDI dataset proposed in Ragusa et al.[1] annotating with bounding boxes the presence of 57 different points of interest. We only considered data acquired using the head-mounted Microsoft HoloLens device. The UNICT-VEDI dataset comprises a set of training videos (at least one per point of interest), plus 7 test videos acquired by subjects visiting a cultural site. Each video of the dataset has been temporally labeled to indicate the environment in which the visitor is moving (9 different environments are labeled) and the point of interest observed by the visitor (57 points of interest have been labeled). For each of the 57 points of interest included in the UNICT-VEDI dataset, we annotated approximately 1,000 frames from the provided training videos, for a total of 54,248 frames.

We considered a total of 9 environments and 57 points of interests. For more details about our dataset go to this page .

You can download the dataset annotating with bounding boxes at this link .


Methods



Recognizing the points of interest observed by visitors in a cultural site is the natural next step after visitor localization (Ragusa et al. [1]). We consider three different variants of the pipeline presented in Ragusa et al. [1] and an approach based on an object detector.

57-POI: the discrimination component of the method in Ragusa et al. [1] is trained to discriminate between the 57 points of interest. No ``negative'' frames are used for training. The rejection of negatives is performed by the rejection component of Ragusa et al. [1];

57-POI-N: the discriminator component of the method in Ragusa et al. [1] is trained to discriminate between 57 points of interest plus the ``negative'' class. In this case, negative frames are explicitly used for training. The rejection component of Ragusa et al. [1] is further used to detect more negatives;

9-Classifiers: nine context-specific instances of the method in Ragusa et al. [1] are trained to recognize the points of interest related to the nine different contexts of the UNICT-VEDI dataset (i.e., one classifier per context). Similarly to 57-POI, no negatives are used for training;

Object-based: A YOLOv3 object detector is used to perform the detection and recognition of each of the 57 points of interest. At test time YOLOv3 returns the coordinates of a set of bounding boxes with the related class scores for each frame. If no bounding box has been predicted in a given frame, we reject the frame and assign it to the ``negative'' class. If multiple bounding boxes are found in a specific frame, we choose the bounding box with the highest class-score and assign its class to the frame.



Paper

F. Ragusa, A. Furnari, S. Battiato, G. Signorello, G. M. Farinella. Egocentric Point of Interest Recognition in Cultural Sites. In International Conference on Computer Vision Theory and Applications, 2019. Download the paper here.



Acknowledgement

This research is supported by PON MISE – Horizon 2020, Project VEDI - Vision Exploitation for Data Interpretation, Prog. n. F/050457/02/X32 - CUP: B68I17000800008 - COR: 128032, and Piano della Ricerca 2016-2018 linea di Intervento 2 of DMI of the University of Catania.




People