Analysis of visual objects in images is a very important component in computer vision systems, which perform object recognition, image retrieval, image registration, and more. Areas where such systems are deployed are diverse and include such applications as surveillance (security), video forensics, and medical image analysis for computer-aided diagnosis, to mention just a few. In particular, the object recognition problem has attracted much attention recently due to the increasing demand for developing real-world systems.
Recognition is mainly divided into two parts: category recognition (classification) and detection/localization. The goal of object category recognition is to classify a given object into one of several pre-specified categories, while object detection is to separate objects of interest from the background in a target image. In the current literature, a popular object recognition paradigm is probabilistic constellation or parts-and-shape models that represent not only the statistics of individual parts, but also their spatial layout. These are based on learning-based classifiers that require an intensive learning/training phase of the classifier parameters and thus are called parametric methods. Object detection is also a critical part in many applications such as image retrieval, scene understanding, and surveillance system; however it is still an open problem because the intra-class variation makes a generic detection very complicated, requiring various types of pre-processing steps. The sliding window scheme is usually used by taking the peak confidence values as an indication of the presence of an objet in a given region. Most recent successful localization methods rely on this technique, but these too still required a training phase. Recently, the recognition task with only one query (training-free) has received increasing attention for important applications such as automatic passport control at airports, where a single photo in the passport is the only example available. Another application is in image retrieval from the web. In the retrieval task, a single probe or query image is provided by users and every gallery image in the database is compared with the single probe, posing an image-to-image matching problem. Recently, the face image retrieval task led to intensive activity in this area, culminating in FRGC (Face Recognition Grand Challenge). More generally, by taking into account a set of images, which represents intra-class variations, more robust object recognition can be achieved. Such sets may consist of observations acquired from a video sequence or by multiple still shots. In other words, classifying an unknown set of images into one of the training classes can be achieved through set-to-image or set-to-set matching without an intensive training phase. As a successful example of set-to-image matching, very recently it has been shown that a trivial nearest-neighbor (NN) based image classifier in the space of the local image descriptors such as SIFT and local self-similarity is extremely simple, efficient and even outperforms the leading learning-based image classifiers.
A huge number of videos are available online today and the number is rapidly growing. Human actions constitute one of the most important parts in movies, TV shows, and consumer-generated videos. Analysis of human actions in videos is considered a very important problem in computer vision because of such applications as human-computer interaction, content-based video retrieval, visual surveillance, analysis of sports events and more. The term “action” refers to a simple motion pattern as performed by a single subject, and in general lasts only for a short period of time, namely just a few seconds. Action is often distinguished from activity in the sense that action is an individual atomic unit of activity. In particular, human action refers to physical body motion. Recognizing human actions from video is a very challenging problem due to the fact that physical body motion can look very different depending on the context: for instance, similar actions with different clothes, or in different illumination and background can result in a large appearance variation; or, the same action performed by two different people may look quite dissimilar in many ways.
The goal of action classification is to classify a given action query into one of several pre-specified categories (for instance, 6 categories from KTH action dataset: boxing, hand clapping, hand waving, jogging, running, and walking). Meanwhile, action detection is meant to separate an action of interest from the background in a target video (for instance, spatiotemporal localization of a walking person). The disadvantages of learning-based methods are that they require a large number of training examples, and explicit motion estimation.
In general, the target video may contain actions similar to the query, but these will typically appear in completely different context as shown in FIG. 1 and FIGS. 2(a)-2(b), where FIG. 1 shows hand-waving action and possibly similar actions, and FIGS. 2(a)-2(b) show the action detection problem in FIG. 2(a) given a query video Q, where it is desired to detect/localize actions of interest in a target video T, with T divided into a set of overlapping cubes and FIG. 2(b) shows space-time local steering kernels (3-D LSKs) capturing the geometric structure of underlying data. Examples of such differences can range from rather simple optical or geometric differences (such as different clothes, lighting, action speed and scale changes); to more complex inherent structural differences such as for instance a hand-drawn action video clip (e.g., animation) rather than a real human action.
Over the last two decades, many studies have attempted to tackle this problem and made impressive progress. Approaches can be categorized on the basis of action representation; namely, appearance-based representation, shape-based representation, optical-flow-based representation, interest-point-based representation, and volume-based representation.
As examples of the interest-point-based approach, which has gained a lot of interest, videos as spatiotemporal bag-of-words have been considered by extracting space-time interest points and clustering the features, and then using a probabilistic Latent Semantic Analysis (pLSA) model to localize and categorize human actions. Another approach also used spatiotemporal features, where they extended the naive Bayes nearest neighbor classifier, which was developed for object recognition, to action recognition. By modifying the efficient searching method based on branch-and-bound for the 3-D case, they provided a very fast action detection method. However, the performance of these methods can degrade due to 1) the lack of enough training samples; 2) misdetections and occlusions of the interest points since they ignore global space-time information.
Another approach recently employed a three-dimensional correlation scheme for action detection. They focused on sub-volume matching in order to find similar motion between the two space-time volumes, which can be computationally heavy. A further approach uses boosting on 3-D Haar-type features inspired by similar features in 2-D object detection. While these features are very efficient to compute, many examples are required to train an action detector in order to achieve good performance. They further proposed a part-based shape and flow matching framework and showed good action detection performance in crowded videos.
One approach generalized canonical correlation analysis to tensors and showed very good accuracy on the KTH action dataset, but their method requires a manual alignment process for camera motion compensation. A further approach proposed a system to search for human actions using a coarse-to-fine approach with a five-layer hierarchical space-time model. These volumetric methods do not require background subtraction, motion estimation, or complex models of body configuration and kinematics. They tolerate variations in appearance, scale, rotation, and movement to some extent. Methods which aim at recognizing actions based solely on one query, are very useful for applications such as video retrieval from the web (e.g., viewdle, videosurf). In these methods, a single query video is provided by users and every gallery video in the database is compared with the given query, posing a video-to-video matching problem.
Accordingly, there is a need to develop an approach to the problem of human action recognition as a video-to-video matching problem, where recognition is generally divided into two parts: category classification and detection/localization. There is a further need for addressing detection and category classification problems simultaneously by searching for an action of interest within other “target” videos with only a single “query” video.