Multi-label video classification with Gated Recurrent Units.

Jan 8, 2018

Abstract

Different architectures of deep neural networks are proposed and evaluated to perform multi-label video classification on a new dataset called Trailers15k. The classification is performed with two different approaches in order to analyze the impact of spatio-temporal features in the task. First, considering only the spatial features of video frames. For this, two models based on convolutional neuronal networks using the knowledge transfer technique are presented. Subsequently, the classification is performed incorporating temporal features. In this second aproach, two architectures are presented based on a convolutional network and regulated recurring units.

Models

For all the models a convolutional neural network is used. Its architecture is based on the pre-trained Inception-v3 model on ImageNet with certain changes to allow the multi-label classification.

Trailers15k is used for this project. It is a multi-label dataset consisting of 15,000 movie trailers associated with 10 different classes. A video of this type is characterized by resuming the most important scenes of the entire film in a short video that generally is out of chronological order. Such settinng makes difficult to obtain a representation to perform multi-label clasification.

The two models that exploit only the spatial features in the video are the following.

Single-Frame-CNN: this model classifies the whole video using only middle frame.
Multi-Frame-CNN-Average: this model classifies using one frame per second. A simple algorithm is used to select the most representative frames. Each frame is classified separately and later all the predictions are averaged to produce a classification for the whole video.

The models that exploit the spatio-temporal features of video frames are the following.

Multi-Frame-CNN-CS-GRU: this model classifies using one frame per second. The particularity of this model is thta it employ a compact spatial representation of the frames. This architecture is a combination of a convolutional neuronal network and a recurrent neural network, specifically a regulated recurrent unit. Each frame is introduced and classified by a CNN to obtain the confident scores. The sequence of confidence scores for the frames is used as the input to the regulated recurrent unit.
Multi-Frame-CNN-GRU: this model classifies using almost the same approach as the Multi-Frame-CNN-CS-GRU model but with the difference that it uses a richer spatial representation. In this case, the input to the regulated recurrent unit are the spatial features obtained directly from the convolutional neural network.

The classification models are described more in detail in [1].

Results

The aim of the project was to analyze the impact of spatio-temporal features in the multi-label video classification. Four different models were presented to explore two types of feature representations: spatial and spatio-temporal. The results obtained are summarized in the following table.

Model	Accuracy	Precision	Recall
Single-Frame-CNN (spatial)	14.90%	25.94%	16.18%
Multi-Frame-CNN-Avg (spatial)	24.27%	37.28%	16.18%
Multi-Frame-CNN-CS-GRU (spatio-temporal)	32.62%	52.95%	35.03%
Multi-Frame-CNN-GRU (spatio-temporal)	44.22%	62.76%	53.48%

There is a straightforward interpretation of this results. The models that employ a recurrent network in their architecture, motivated by the idea of incorporating temporal information, allow a better classification. The best result obtained by last model Multi-Frame-CNN-GRU are due to richer spatial features representation of the frames.

Code

The code of the models is available on Bitbucket repository.

Dataset

The dataset created for this project is described and available on Trailers15k.

Talk

A talk was offered as part of the event Women in Data Science and the dissemination activities of the work at the deep learning group led by Prof. Gibran Fuentes Pineda.

People

Acknowledgements

This research was carried out thanks to the Program of Support for Research Projects and Technological Innovation (PAPIIT) of the UNAM IA104016 Generation of video summaries based on deep neural networks from beginning to end. I thank the DGAPA-UNAM for the scholarship received.

References

[1] B. Montalvo. Clasificación multi-etiqueta de videos cortos usando unidades recurrentes reguladas. Master’s thesis, Universidad Nacional Autónoma de México, Mexico, 2018 (in Spanish). [PDF]