With the massive growth of the video data in the modern society, the video content understanding has become an important research topic, and the detection of violent incidents in the surveillance videos is of great significance to the maintenance of the public safety. The violent incidents in the video can be automatically screened and identified through a violent incident detection technology. On one hand, the violent incidents can be found in time; and on the other hand, effective offline screening can be carried out for the behaviors that may endanger the public safety in the video big data. However, the detection of the violent incidents in the video has high technical difficulties, including the following:
(I) the violent incidents are highly polymorphic, and it is difficult to extract the universal feature descriptions therefrom;
(II) too little positive sample data can be used for the model training;
(III) the resolution of the surveillance video is low.
Most of the existing mainstream methods for the recognition and detection of behaviors in a video take the deep learning as the core technology, and use the deep learning model to automatically extract and identify the features of the video content. However, due to the polymorphism of the violent incidents and the lack of available training data, the deep learning model which needs to be supported by massive data is difficult to work on this issue. Therefore, for the detection of the violent incidents, the methods based on local spatial-temporal feature descriptors are still popular. The main idea is to reflect the behavior features by modeling the relationship between the local feature descriptors (e.g., spatial-temporal interest points).