The dataset used in this challenge is an extension of the Social Popularity Image Dynamics dataset (SPID 2018) used in [1] and [2].
The new dataset consists of ~30K Flickr images labelled with their engagement scores (i.e., views, comments and favorites) in a period of 30 days from the upload in the social platform. For each image, the dataset also includes user's and photo's social features that have been proven to have an influence on the image popularity on Flickr (e.g., number of user's contacts, number of user's groups, mean views of the user's images, photo tags, etc.).
Sketch of the crawlig procedure. |
The information related to the selected images have been crawled within 2 hours from their upload. Then, at least two samples per day have been collected for the subsequent 30 days.
The dataset contains the timestamps (in seconds) of the upload and crawling instants, as well as the timestamps of each sampling of the engagement scores. The Ground Truth sequences have been defined by interpolating the samples at regular intervals of 24 hours (i.e., 24 * 3.600 seconds), starting from 24 hours from the image upload, as shown in the following plot.
Sampling example of the number of views at regular intervals of 24 hours. |
Crawling details
In SPID 2018 [1,2], about 20.000 Flickrs photos and related metadata (i.e., photo, user and groups statistics) have been crawled and monitored for a period of 30 days. During the 30 days monitoring, some photos have been removed by authors or not longer publicly avalable. As consequence, only ~17.000 photos have been tracked for 30 consecutive days.
Then, the dataset has been augmented by crawling a new set of ~15.000 photos, with ~13.000 photos monitored for 30 days.
Therefore, the new dataset contains ~30.000 photos that have been split into train and test set.
The train data will be released as soon as the challenge will start. Test data will be available only few days before the result submission deadline.
The data will be provided in .csv format. Only textual/numeric data will be provided. If participants are willing to exploit the images they need to download them directly from Flickr. To do so, the images' URLs are provided.
Specifically, the data contain the following information about photos, users and groups:
FlickrId INT, -- Id of the image on Flickr
UserId TEXT, -- Id of the user on Flickr
URL TEXT, -- URL of the image on Flickr
Path TEXT,
DatePosted TEXT, -- Timestamp of the date of the image post on Flickr
DateTaken, -- Timestamp of the claimed date of the image creation
DateCrawl TEXT -- Timestamp of the crawling time
Camera TEXT, -- Camera model (if available)
Size INT, -- Total number of pixel of the original image (if available)
Title TEXT, -- Title of the post
Description TEXT, -- Description of the post
NumSets INT, -- Number of albums the photo is shared in
NumGroups INT, -- Number of groups the photo is shared in
AvgGroupsMemb REAL, -- Avg number of members of the groups in which the photo is shared
AvgGroupPhotos REAL, -- Avg number of photos of the groups in which the photo is shared
Tags TEXT, -- Social tags of the post
Latitude TEXT, -- (if available)
Longitude TEXT, -- (if available)
Country TEXT, -- (if available)
UserId TEXT, -- Id of the user on Flickr
Username TEXT,
Ispro INT, -- If the user's account is registered as professional
HasStats INT,
Contacts INT, -- Number of contacts of the user on Flickr
PhotoCount INT, -- Number of photos of the user
MeanViews REAL, -- Mean number of views of the user's photos
GroupsCount INT, -- Number of groups the user is enrolled in
GroupsAvgMembers REAL, -- Avg number of members of the groups in which the the user is enrolled
GroupsAvgPictures REAL -- Avg number of photos of the groups in which the the user is enrolled
[1] Ortis, Alessandro, Giovanni Maria Farinella, and Sebastiano Battiato. "Prediction of Social Image Popularity Dynamics." International Conference on Image Analysis and Processing. Springer, Cham, 2019.
[2] A. Ortis, G. M. Farinella and S. Battiato, "Predicting Social Image Popularity Dynamics at Time Zero" in IEEE Access. doi: 10.1109/ACCESS.2019.2953856.