It is known to use video input to detect action types of actions. Action type detection may be used in video surveillance system and for retrieving video segments from stored video data using search queries. Video search queries may be expressed in terms of action type, for example to search for a video fragment where somebody exchanges an item with another person. To facilitate a wide range of applications, detection of dozens of different types of action is desirable. State-of the-art detection of human actions from such videos is not very reliable yet.
Action detectors based on the bag-of-features model (hereafter referred to as bag-of-features action detectors) have demonstrated to be also very effective for the task of action detection. Bag-of-features action detectors are constructed by combining some spatiotemporal feature detections, a codebook or quantizer to transform the features detections into histograms as a means to represent a (part of a) video as a feature vector, and a classifier to detect the action. An example of a bag of feature process for detecting the presence of a predetermined type of action in video data comprises detecting spatiotemporal points of interest in the video data; extracting a feature descriptor from the video data in spatiotemporal areas at the detected spatiotemporal points of interest; assigning the detected spatiotemporal points of interest to bins in a feature vector, also called bag-of-feature histogram, based on the extracted feature data; computing bin counts of the feature histogram for each respective bin, of spatiotemporal points of interest that have been assigned to the bin; computing match scores between the feature histogram and each of a plurality of reference histograms for the predetermined type of action, and summing products of the match scores with factors for the reference histograms. The method may be applied to a plurality of predetermined types of action respectively, using different sets of reference histograms for the respective different predetermined types of action. The match score may be a sum over the bins of the smallest of the bin value of the feature histogram for the bin and the bin value the reference histogram for the bin. However, other known measures for histogram intersection may be used. A detection score may be computed from the result of summing the products by applying a predetermined detection function to a sum of the result and a bias value. A yes/no detection may be obtained by comparing the result of summing the products with a threshold.
Bag-of-features action detectors are attractive because the prior art provides for an automatic training process to determine the reference histograms by means of using bin values obtained from training video segments. Typically this involves use of training video segments with associated designation codes that indicate the type (or types) of action that is shown in the video segment. From these video segments a positive and negative training set for a predetermined type of action can be derived, with video segments associated with the associated designation code of the predetermined type of action and not associated with that designation code respectively. The training process is used to make a selection of reference histograms that maximizes the correlation between detection results with membership of the positive and negative training sets. Usually training process is also used to select a bias value, factors for the reference histograms, as well as parameters for the assignment of the detected spatiotemporal points of interest to bins of the feature histogram.
The advantage such bag-of-features action detectors is simplicity, straightforward implementation, and computational efficiency. Such bag-of-features detectors have proven to be effective for a range of actions including quite complex action such as digging in the ground, falling onto the ground, and chasing somebody.
Yet, for the detection of more complex actions, such as the exchange of an item, or burying or hauling something, the standard bag-of features action detectors did not suffice. One of the reasons that the detection of exchange, bury or haul is hard, is that these actions involve detailed motion patterns and their duration is short. The large part of the total set of features is triggered by irrelevant actions that precede or follow the detailed action (e.g. walking) or by background clutter (e.g. a person moving in the background). The relevant subset of features is likely to be a small fraction of the total set.
Code book creation for visual recognition is described in an article by Jurie et al. published in the 10th international conference on computer vision 2005 in Beijing (ICCV 2005) on pages Vol 1 pages 604-610 (EPO reference XP010854841). Jurie et al use a codebook algorithm that selects patches from an image (e.g. 11×11 pixel gray level patches), computes a descriptor value from the image content in the patch, and quantizes the descriptor value. Quantization is based on clustering, that is, the selection of descriptor values that form the centers of clusters of descriptor values obtained from patches in training images. Thus, clustering produces many clusters in the region of descriptor values where maximum density of descriptor values occurs. In addition, Jurie et all propose to obtain centers of clusters lower density regions by repeating clustering after elimination patches in the maximum density region.
Jurie et al note that in this case the codebook will become larger than the number of useful features, so that subsequent feature selection becomes necessary. Jurie names mutual information, odds ratio and linear SVM weights as several feature selection methods, but discloses no detail. Feature selection using linear support vector machines is disclosed by
Brank et al in a Microsoft research technical report (MSR-TR-2002-63) titled “Feature selection using linear support vector machines” (EPO reference XP055055892. Brank et all consider the problem that the training set is too large for use to perform complete SVM training. Instead Brank et al. proposes initial SVM training using a reduced training set, followed by the elimination of features based on feature scoring, and SVM retraining using only retained features.
Brank et al disclose information gain, odds ratio and linear SVM weights as feature scoring methods for the elimination. Information gain expresses the entropy increase resulting from elimination of a feature. Odds ratio uses the ratio between the probabilities of the feature in positive and negative training examples. Use of linear SVM weights considers the case that SVM detection is based by comparing a linear SVM kernel value with a threshold. The linear SVM kernel is a sum of products of feature counts with weight values. In this case, features are retained based on the size of the corresponding weight in the linear SVM kernel.
Wang et all discloses action recognition using spatio-temporal interest points in an article titled “Action recognition with multi-scale spatio-temporal contexts”, published in The 2011 IEEE conference on computer vision and pattern recognition pages 3185-3192 (EPO reference XP032037995). Wang et al propose to capture contextual information about each interest point based on the density of features near the interest point. Training is used to select an SVM kernel that uses this density.