The present invention relates to a method and apparatus for classifying and querying temporal and spatial information in media content; for representing objects, regions, events, actions, scenes, and visual features using symbols; for constructing symbol string descriptors that order the symbols in time or space to represent temporal or spatial information; for computing precedence template descriptor matrices from the symbol strings to describe the temporal and spatial ordering of the symbols statistically; for comparing precedence template descriptor matrices in order to compare the temporal or spatial information in the video or images; for classifying video and images using symbols strings and precedence template descriptors; and for querying video and images using symbol strings and precedence template descriptors.
The growing proliferation of digital photographs and video is increasing the need for more sophisticated methods for automatically analyzing, cataloging, and searching for digital representations of media content. The recent development of content-based query systems has advanced the capabilities for searching for images and video by color, texture, shape, motion and other features. Content-based methods are effective in allowing searching by feature similarity but are limited in capability of automatically deriving higher semantic level understanding of the content. Content-based methods can improve searching and attain better understanding of the content by capturing spatial and temporal information in addition to these features. In particular, effective methods are needed for describing scene and temporal information represented by the composition of region, objects or events in space or time.
Since humans perceive images by breaking the scenes into surfaces, regions, and objects, the spatial, temporal and feature attributes of the objects and the relationships to each other are important characteristics of visual information. Content-based retrieval systems that use global descriptors, such as color histograms miss the important spatial information. Furthermore, in a large collection of photographs, many regions recur, such as those that correspond to blue skies, oceans, grassy regions, orange horizons, mountains, building facades, and so forth. For photographic images, therefore, the detection and description of these regions and their spatial relationships is essential for truly characterizing the images for searching, classification and filtering purposes. Similarly, in video, the temporal relationships of events and objects are important features of the video content. Effectively describing the temporal composition features is essential for searching and filtering video by content.
There are many ways to capture the region information in images. Some recent approaches include 2-D strings, xcex8-R representations, local histograms, co-occurrence matrices, and region or event tables. However, composition descriptors such as the 2-D string and its variants are brittle in the sense that minor changes in region locations can greatly affect the comparison of two images. Descriptors such as xcex8-R and co-occurrence matrices are not widely applicable due to sensitivity to scale, which is problematic when comparing images of different resolutions or video segments of different temporal durations. Region or event tables are general in the sense that they capture the spatial and temporal locations, however, they do not provide a solution for measuring spatial or temporal similarity.
What is needed, therefore, and what is an object of the present invention, is to provide a system and method for characterizing media content by spatial and/or temporal ordering of regions, objects, or events.
It is another object of the invention to provide a system and method for classifying media content by the representations of spatial and/or temporal ordering.
Yet another object of the invention is to provide a system and method for searching stored media content by use of the classification by representations of spatial and/or temporal ordering.
In accordance with the aforementioned needs and objects, the present invention is directed towards an apparatus and method for classifying and querying temporal and spatial information in video; for representing objects, regions, events, actions, scenes, and visual features using symbols; for constructing symbol string descriptors that order the symbols in time or space to represent temporal or spatial information; for computing precedence template descriptor matrices from the symbol strings to describe the temporal and spatial ordering of the symbols statistically; for comparing precedence template descriptor matrices in order to compare the temporal or spatial information in the video or images; for classifying video and images using symbols strings and precedence template descriptors; and for querying video and images using symbol strings and precedence template descriptors.
Precedence Template (PT) descriptors of the present invention can be used for classifying or annotating image and video content by assigning each class of event, action, region or object a unique symbol and then building symbol strings to represent sequences in space or time. The symbol strings can be decoded using a library of annotated PT descriptors to automatically label the image and video content. Furthermore, the PT descriptors can be used for searching by sketch or searching by example where the search query input is converted to symbol strings, which are efficiently compared based on the presence and relative counts of PT descriptors. The Precedence Template (PT) descriptor can be used for classifying and querying video based on the spatial and temporal orderings of regions, objects, actions or events. Applied to video, the PT descriptors provide a way to compare the temporal order of events, actions, or objects such as those represented in a scene transition graph, key-frame list, or event string. Applied to images, the PT descriptors provide a way to compare the spatial arrangement of image regions or objects. By capturing the spatial and temporal relationships statistically, the PT descriptors provide a robust way to measure similarity in the presence of insertions, deletions, substitutions, replications and relocations of events, actions, regions or objects.