The invention relates to image processing for indexing and retrieval of image sequences, e.g., video. More particularly, the invention relates to an efficient framework for context-based indexing and retrieval of image sequences with emphasis on motion description.
With the explosion of available multimedia content, e.g., audiovisual content, the need for organization and management of this ever growing and complex information becomes important. Specifically, as libraries of multimedia content continue to grow, it becomes unwieldy in indexing this highly complex information to facilitate efficient retrieval at a later time.
By standardizing a minimum set of descriptors that describe multimedia content, content present in a wide variety of databases can be located, thereby making the search and retrieval more efficient and powerful. International standards such as Moving Picture Experts Group (MPEG) have embarked on standardizing such an interface that can be used by indexing engines, search engines, and filtering agents. This new member of the MPEG standards is named multimedia content description interface and has been code-named xe2x80x9cMPEG-7xe2x80x9d.
For example, typical content description of a video sequence can be obtained by dividing the sequence into xe2x80x9cshotsxe2x80x9d. A xe2x80x9cshotxe2x80x9d can be defined as a sequence of frames in a video clip that depicts an event and is preceded and followed by an abrupt scene change or a special effect scene change such as a blend, dissolve, wipe or fade. Detection of shot boundaries enables event-wise random access into a video clip and thus constitutes the first step towards content search and selective browsing. Once a shot is detected, representative frames called xe2x80x9ckey framesxe2x80x9d are extracted to capture the evolution of the event, e.g., key frames can be identified to represent an explosion scene, an action chase scene, a romantic scene and so on. This simplifies the complex problem of processing many video frames of an image sequence to just having to process only a few key frames. The existing body of knowledge in low-level abstraction of scene content such as color, shape, and texture from still images can then be applied to extract the meta-data for the key frames.
While offering a simple solution to extract meta-data, the above description has no motion-related information. Motion information can considerably expand the scope of queries that can be made about content (e.g., queries can have xe2x80x9cverbsxe2x80x9d in addition to xe2x80x9cnounsxe2x80x9d). Namely, it is advantageous to have additional conditions on known information based on color, shape, and texture descriptors, be correlated to motion information to convey a more intelligent description about the dynamics of the scene that can be used by a search engine. Instead of analyzing a scene from a single perspective and storing only the corresponding meta-data, it is advantageous to capture relative object motion information as a descriptor that will ultimately support fast analysis of scenes on the fly from different perspectives, thereby enabling the ability to support a wider range of unexpected queries. For example, this can be very important in application areas such as security and surveillance, where it is not always possible to anticipate the queries.
Therefore, there is a need in the art for an apparatus and method for extracting and describing motion information in an image sequence, thereby improving image processing functions such as content-based indexing and retrieval, and various encoding functions.
One embodiment of the present invention is an apparatus and method for implementing object motion segmentation and object trajectory segmentation for an image sequence, thereby improving or offering other image processing functions such as context-based indexing of the input image sequence by using motion-based information. More specifically, block-based motion vectors are used to derive optical flow motion parameters, e.g., affine motion parameters.
Specifically, optical flow (e.g., affine) object motion segmentation is initially performed for a pair of adjacent frames. The affine motion parameters are then used to determine or identify key objects within each frame. These key objects are then monitored over some intervals of the image sequence (also known as a xe2x80x9cshotxe2x80x9d having a number of frames of the input image sequence) and their motion information is extracted and tracked over those intervals.
Next, optical flow (e.g., affine) trajectory segmentation is performed on the image sequence. Specifically, the affine motion parameters generated for each identified key object for each adjacent pair of frames are processed over an interval of the image sequence to effect object trajectory segmentation. Namely, motion trajectory such as direction, velocity and acceleration can be deduced for each key object over some frame interval, thereby providing an another aspect of motion information that can be exploited by query.