Violent Scene Detection Using Trajectory-based Features

Fudan University and City University of Hong Kong @ MediaEval2012

[See more details in our notebook paper]

Overview

Automatically detecting violent scenes in videos has great potential in several applications, such as movie selection or recommendation for children. The annual MediaEval evaluation introduced this task in 2011, to promote research on this topic.

In MediaEval 2012, we participated in this challenging task and explored several interesting issues with a particular focus on novel features. Our approach achieved top performance in mAP@20 (mean average precision over top 20 detected shots) and runner-up in mAP@100, among all 35 submissions worldwide.

Technical Components

Our system framework is shown below, where the circled numbers indicate the submitted result runs. We extract a diverse set of audio-visual features and use Chi-Square Kernel SVM for violent scene detection.

Features

Trajectory-based Features: Dense local patch trajectories are first extracted, based on which we generate motion representations called TrajMF by exploring relative locations and motions between trajectory pairs. These features serve as a strong baseline in our system. Please refer to our ECCV 2012 paper for more details.

SIFT: This is based on standard SIFT local features and the well-known bag of visual words (BoW) method.

Spatial-Temporal Interest Points (STIP): STIP captures a space-time volume in which video pixel values have large variations in both space and time. This feature is also converted to BoW histograms, using a vocabulary of 4000 codewords.

MFCC: The MFCCs are the only audio feature in our framework, which are computed for every 32ms time-window with 50% overlap. Again, the BoW is used to convert a set of MFCCs from each video shot into fixed dimensional vectors, using a vocabulary of 4000 audio codewords.

Concept-based Feature: Different from the low-level features described above, concept-based feature contains mid-level indicators where each dimension is the prediction output of a semantic concept. Ten concepts are provided in MediaEval2012, covering violence-related topics such as "presence of blood", "fights", "presence of fire", "gunshots", etc. We use the above low-level features to train SVM detectors for each of the concepts, and generate a concept-based representation of 10 dimensions.

Temporal Smoothing

It is well-known that temporal structure is useful for video content analysis. There exist complex methods in the modeling of temporal information, such as the use of graphical models. In this task, we opt for a simple but very efficient temporal smoothing method, which takes clues from the shots before and after a target shot into account. Two smoothing choices are adopted.

Feature Smoothing: This uses the averaged feature over a three-shot window to represent the shot in the middle of the window. Classification/Detection is performed on the smoothed features.

Score Smoothing: Different from feature smoothing, score smoothing uses features from each single shot for classification, and smooth (average) prediction scores over three-shot windows.

Submissions and Results

As indicated in the above framework, we submitted five runs based on different feature combinations and/or smoothing choices. Our baseline run 5 uses the seven trajectory-based features, and run 4 includes three additional features (SIFT, STIP and MFCC). Run 3 further combines the concept-based feature. Run 2 and run 1 are generated by the two temporal smoothing methods respectively, using the same feature set as run 4. In all the submitted runs, kernel-level early fusion (mean of the individual-feature kernels) is used to combine multiple features.

The figure on the right shows the performance of all the 35 official submissions, where our run 1 produces the highest mAP@20 accuracy (0.736). Our run 5, which uses only the seven trajectory-based features, already shows very competitive results (mAP@20=0.656, mAP@100=0.539). The three additional features used in run 4 are not very helpful. Also, the concept-based scores (run 3) do not improve the results. Further comparing the two temporal smoothing choices, score smoothing (run 1) is significantly better.

[See more details in our notebook paper]