Human activity analysis is required for a variety of applications including video surveillance systems, human-computer interaction, security monitoring, threat assessment, sports interpretation, and video retrieval for content-based search engines [A1, A2]. Moreover, given the tremendous number of video data currently available online, there is a great demand for automated systems that analyze and understand the contents of these videos. Recognizing and localizing human actions in a video is the primary component of such a system, and also typically considered to be the most important, as it can affect the performance of the whole system significantly. Although there are many methods to determine human actions in highly controlled environments, this task remains a challenge in real world environments due to camera motion, cluttered background, occlusion, and scale/viewpoint/perspective variations [A3-A6]. Moreover, the same action performed by two persons can appear to be very different. In addition, clothing, illumination and background changes can increase this dissimilarity [A7-A9].
To date, in the computer vision community, “action” has largely been taken to be a human motion performed by a single person, taking up to a few seconds, and containing one or more events. Walking, jogging, jumping, running, hand waving, picking up something from the ground, and swimming are some examples of such human actions [A1, A2, A6]. Accordingly, it would be beneficial for a solution to the problem of event recognition and localization in real environments to be provided. It would be further beneficial for such a solution to offer a fast data-driven approach, which describes the content of a video.
Similarly, in a range of applications it would beneficial to provide an automated video surveillance system capable of determining/detecting unusual or suspicious activities, uncommon behaviors, or irregular events in a scene. Accordingly, it would be beneficial to provide a system whose primary objective in respect of automated video surveillance systems is anomaly detection because the sought after situations are not observed frequently. Although the term anomaly is typically not defined explicitly, such systems are based upon the implicit assumption that events that occur occasionally are potentially suspicious, and thus may be considered as being anomalous [B3-B12]. It would also be beneficial if the system were self-starting such that no human training or input was required such that the system establishes anomalies with respect to the context and regularly observed patterns.
Within the prior art, spatio-temporal volumetric representations of human activity have been used to eliminate some pre-processing steps, such as background subtraction and tracking, but have been shown to suffer major drawbacks such as requiring salient point detection in activity detection implementations and ignoring geometrical and temporal structures of the visual volumes due to the non-ordered manner of storage. Further, they are unable to handle scale variations (spatial, temporal, or spatio-temporal) because they are too local, in the sense that they consider just a few neighboring video volumes (e.g., five nearest neighbors in [11] or just one neighbor in [4]). Accordingly, it would be beneficial to have a multi-scale, hierarchical solution which incorporates spatiotemporal compositions and their uncertainties allowing statistical techniques to be applied to recognize activities or anomalies.
As noted above, event understanding in videos is an important element of all computer vision systems either in the context of visual surveillance or action recognition. Therefore, an event or activity should be represented in such a way that it retains all of the important visual information in a compact structure.
In the context of human behavior analysis, many studies have focused on the action recognition problem by invoking human body models, tracking-based methods, and local descriptors [A1]. The early work often depended on tracking [A16-A19], in which humans, body parts, or some interest points were tracked between consecutive frames to obtain the overall appearance and motion trajectory. It is recognized that the performance of these algorithms is highly dependent on tracking, which sometimes fails for real world video data [A20].
Alternatively, shape template matching has been employed for activity recognition; e.g., two-dimensional (2D) shape matching [A23] or its three-dimensional (3D) extensions, as well as exploiting optical flow matching [A13, A24, A25]. In these prior art approaches, action templates are constructed to model the actions and these are then used to locate similar motion patterns. Other studies have combined both shape and motion features to achieve more robust results [A26, A27], claiming that this representation offers improved robustness to object appearance [A26].
In a recent study [A27], shape and motion descriptors were employed to construct a shape motion prototype for human activities within a hierarchical tree structure and action recognition was performed in the joint shape and motion feature space. Although it may appear that these prior art approaches are well suited to action localization, they require a priori high-level representations of the actions to be identified. Further, they depend on such image pre-processing stages as segmentation, object tracking, and background subtraction [A28], which can be extremely challenging when it is considered that in real-world deployments, one typically has unconstrained environments.
Normal events observed in a scene will be referred to herein as the “dominant” behaviors. These are events that have a higher probability of occurrence than others in the video and hence generally do not attract much attention. One can further categorize dominant behaviors into two classes. In the literature on human attention processes, the first usually deals with foreground activities in space and time while the second describes the scene background (by definition, the background includes pixels in the video frames whose photometric properties, such as luminance and color, are either static or stationary with respect to time).
Typically, the detection of the latter is more restrictively referred to as background subtraction, which is the building block of many computer vision algorithms. However, dominant behavior detection is more general and more complicated than background subtraction, since it includes the scene background while not being limited to it. Thus the manner in which these two human attention processes differ is the way that they use the scene information. Most background subtraction methods are based on the principal that the photometric properties of the scene in the video, such as luminance and color, are stationary. In contrast, dominant behavior understanding can be seen as a generalization of the classical background subtraction method in which all of the dynamic contents of the video come into play as well.
In the context of abnormality detection, approaches that focus on local spatio-temporal abnormal patterns are very popular. These rely mainly on extracting and analyzing local low-level visual features, such as motion and texture, either by constructing a pixel-level background model and behavior template [B29, B30, B31, B32] or by employing spatio-temporal video volumes, \emph{STV}s, (dense sampling or interest point selection) [B4, B33, B34, B35, B36, B37, B38, B39, B40, B41, B42, B43, B68, B31]. In large part, the former relies on an analysis of the activity pattern (busy-idle rates) of each pixel in each frame as a function of time. These are employed to construct a background model, either by analyzing simple color features at each pixel [B29] or more complex motion descriptors [B8, B32].
More advanced approaches also incorporate the spatio-temporal compositions of the motion-informative regions to build background and behavior templates [B31, B43, B44]} that are subtracted from newly observed behaviors in order to detect an anomaly. In [B8], dynamic behaviors are modeled using spatio-temporal oriented energy filters to construct an activity pattern for each pixel in a video frame. Generally, the main drawback associated with these methods is their locality. Since the activity pattern of a pixel cannot be used for behavioral understanding, their applicability in surveillance systems is restricted to the detection of local temporal phenomena [B8, B30].
In order to eliminate the requirement for such pre-processing, Derpanis et al. [A10] proposed so-called “action templates”. These are calculated as oriented local spatio-temporal energy features that are computed as the response of a set of tuned 3D Gaussian third order derivative filters applied to the data. Sadanand et al. [A29] introduced action banks in order to make these template based recognition approaches more robust to viewpoint and scale variations Recently, tracking and template-based approaches have been combined to improve the action detection accuracy [A18,A30].
In a completely different vein within the prior art, models based on exploiting so-called bags of local visual features have recently been studied extensively and shown promising results for action recognition [A3, A7, A11, A26, A8, A31, A32, A33, A34, A49]. The idea behind the Bag of Visual Words (BOW) comes from text understanding problems. The understanding of a text document relies on the interpretation of its words. Therefore, high-level document understanding requires low-level word interpretation. Analogously, computers can accomplish the task of visual recognition in a similar way.
In general, visual event understanding approaches based on BOW, extract and quantize the video data to produce a set of video volumes that form a “visual vocabulary”. These are then employed to form a “visual dictionary”. Herein this visual dictionary is referred to as a “codebook”. Using the codebook, visual information is converted into an intermediate representation, upon which sophisticated models can be designed for recognition. Codebooks are constructed by applying “coding” rules to the extracted visual vocabularies. The coding rules are essentially clustering algorithms which form a group of visual words based on their similarity [B43]. Each video sequence is then represented as a histogram of codeword occurrences and the obtained representation is fed to an inference mechanism, usually a classifier.
A major advantage of using volumetric representations of videos is that it permits the localization and classification of actions using data driven non-parametric approaches instead of requiring the training of sophisticated parametric models. In the literature, action inference is usually determined by using a wide range of classification approaches, ranging from sub-volume matching [A24], nearest neighbor classifiers [A40] and their extensions [A37], support [A32] and relevance vector machines [A11], as well as even more complicated classifiers employing probabilistic Latent Semantic Analysis (pLSA) [A3].
In contrast, Boiman et al. [A40] have shown that a rather simple nearest neighbor image classifier in the space of the local image descriptors is equally as efficient as these more sophisticated classifiers. This also implies that the particular classification method chosen is not as critical as originally thought, and that the main challenge for action representation is therefore using appropriate features.
However, it may be noted that classical bag of video word (BOW) approaches suffer from a significant challenge. That is, the video volumes are grouped solely based on their similarity, in order to reduce the vocabulary size. Unfortunately, this is detrimental to the compositional information concerning the relationships between volumes [A3, A41]. Accordingly, the likelihood of each video volume is calculated as its similarity to the other volumes in the dataset, without considering the spatio-temporal properties of the neighboring contextual volumes. This makes the classical BOW approach excessively dependent on very local data and unable to capture significant spatio-temporal relationships. In addition, it has been shown recently that detecting anions using an “order-less” BOW does not produce acceptable recognition results [A7, A31, A33, A38, A41-A43].
What makes the BOW approaches interesting is that they code the video as a compact set of local visual features and do not require object segmentation, tracking or background subtraction. Although an initial spatio-temporal volumetric representation of human activity might eliminate these pre-processing steps, it suffers from a major drawback, namely it ignores the contextual information. In other words, different activities can be represented by the same visual vocabularies, even though they are completely different.
To overcome this challenge, contextual information should be included in the original BOW framework. One solution is to employ visual phrases instead of visual words as proposed in [A43] where a visual phrase is defined as a set of spatio-temporal video volumes with a specific pre-ordained spatial and temporal structure. However, a significant drawback of this approach is that it cannot localize different activities within a video frame. Alternatively, the solution presented by Boiman and Irani [A7] is to densely sample the video and store all video volumes for a video frame, along with their relative locations in space and time. Consequently, the likelihood of a query in an arbitrary space-time contextual volume can be computed and thereby used to determine an accurate label for an action using just simple nearest neighbor classifiers [A40]. However, the significant issue with this approach is that it requires excessive computational time and a considerable amount of memory to store all of the volumes as well as their spatio-temporal relationships. The inventors within embodiments of the invention have established an alternative to this approach as described below.
In addition to Boiman and Irani [A7], several other methods have been proposed to incorporate spatio-temporal structure in the context of BOW [A61]. These are often based on co-occurrence matrices that are employed to describe contextual information. For example, the well-known correlogram exploits spatio-temporal co-occurrence patterns [A4]. However, only the relationship between the two nearest volumes was considered. This makes the approach too local and unable to capture complex relationships between different volumes. Another approach is to use a coarse grid and construct a histogram to subdivide the space-time volumes [A35]. Similarly, in [A36], contextual information is added to the BOW by employing a coarse grid at different spatio-temporal scales. An alternative that does incorporate contextual information within a BOW framework is presented in [A42], in which three-dimensional spatio-temporal pyramid matching is employed. While not actually comparing the compositional graphs of image fragments, this technique is based on the original two-dimensional spatial pyramid matching of multi-resolution histograms of patch features [A41]. Likewise in [A44], temporal relationships between clustered patches are modeled using ordinal criteria, e.g., equals, before, overlaps, during, after, etc., and expressed by a set of histograms for all patches in the whole video sequence. Similar to [A44], in [A45] ordinal criteria are employed to model spatio-temporal compositions of clustered patches in the whole video frame during very short temporal intervals.
However, as with Boiman and Irani [A7] the main problems associated with this are the large size of the spatio-temporal relationship histograms and the many parameters associated with the spatio-temporal ordinal criteria. Accordingly [A46] exploits spatial information which is coded through the concatenation of video words detected in different spatial regions as well as data mining techniques, which are used to find frequently occurring combinations of features. Similarly, [A47] addresses the complexity and processing overhead by using the spatial configuration of the 2D patches through incorporating their weighted sum. In [A38], these patches were represented using 3D Gaussian distributions of the spatio-temporal gradient and the temporal relationship between these Gaussian distributions was modeled using hidden Markov models (HMMs). An interesting alternative is to incorporate mutual contextual information of objects and human body parts by using a random tree structure [A28, A34] in order to partition the input space. The likelihood of each spatio-temporal region in the video is then calculated. The primary issue with this approach [A34], however, is that it requires several pre-processing stages including background subtraction, interest point tracking and detection of regions of interest.
Accordingly, within the prior art hierarchical clustering has been presented as an attractive way of incorporating the contextual structure of video volumes, as well as presenting the compactness of their description [A33, A11]. Accordingly, a modified version of [A7] was presented in [A11] with a hierarchical approach in which a two-level clustering method is employed. At the first level, all similar volumes are categorized. Then clustering is performed on randomly selected groups of spatio-temporal volumes while considering the relationships in space and time between the five nearest spatio-temporal volumes. However, the small number of spatio-temporal volumes involved again makes this method inherently local in nature. Another hierarchical approach is presented in [A33] attempting to capture the compositional information of a subset of the most discriminative video volumes. However, within these prior art solutions presented to date, although a higher level of quantization in the action space produces a compact subset of video volumes, it also significantly reduces the discriminative power of the descriptors, an issue which is addressed in [A40].
Generally, the prior art described above for modeling the mutual relationships between video volumes have one or more limitations including, but not limited to, considering relationships between only a pair of local video volumes [A42, A4]; being too local and unable to capture interactions of different body parts [A33, A48]; and considering either spatial or temporal order of volumes [A4].