Video-based object and activity recognition and retrieval systems use a variety of descriptors to represent visual attributes. In systems that handle broad categories of objects and activities, no single descriptor type will normally suffice to represent all of the necessary attributes. For example, in activity recognition, there are normally two broad categories of activities. First, there are articulated activities, which are characterized by motion of a person's arms and legs or the motion of parts of a vehicle (e.g., doors) or other object. Second, there are non-articulated activities, which are characterized by whole body or object motion, such as a vehicle turning or simply moving forward. No single descriptor has the power to represent both articulated and non-articulated activities. As a result, systems use multiple descriptor types to handle both types of activities. Another example could be object recognition which uses other descriptor types, some of which are based on color, texture, rigidity, and shape.
The use of multiple descriptor types however complicates many of the steps used in recognition and retrieval systems. For example, descriptors of different types cannot be compared. However, there are two high level types of approaches to address this issue of the inability to compare descriptors of different types. First, there are approaches that perform retrieval on individual descriptors and combine results. Second, there are approaches that attempt retrieval based on heterogeneous sets of descriptors.
The subject matter described in this background section could be pursued, but it has not necessarily been previously conceived or pursued. Therefore, unless otherwise indicated herein, the subject matter described in this background section is not prior art to the claims in this application and is not admitted to be prior art by inclusion in this background section.