The invention relates to image processing for describing the motion of object(s) in image sequences, e.g., video. More particularly, the invention relates to an efficient framework for object trajectory segmentation, which in turn, can be employed to improve image processing functions, such as context-based indexing and retrieval of image sequences with emphasis on motion description.
With the explosion of available multimedia content, e.g., audiovisual content, the need for organization and management of this ever growing and complex information becomes important. Specifically, as libraries of multimedia content continue to grow, it becomes unwieldy in indexing this highly complex information to facilitate efficient retrieval at a later time.
By standardizing a minimum set of descriptors that describe multimedia content, content present in a wide variety of databases can be located, thereby making the search and retrieval more efficient and powerful. International standards such as Moving Picture Experts Group (MPEG) have embarked on standardizing such an interface that can be used by indexing engines, search engines, and filtering agents. This new member of the MPEG standards is named multimedia content description interface and has been code-named xe2x80x9cMPEG-7xe2x80x9d.
For example, typical content description of a video sequence can be obtained by dividing the sequence into xe2x80x9cshotsxe2x80x9d. A xe2x80x9cshotxe2x80x9d can be defined as a sequence of frames in a video clip that depicts an event and is preceded and followed by an abrupt scene change or a special effect scene change such as a blend, dissolve, wipe or fade. Detection of shot boundaries enables event-wise random access into a video clip and thus constitutes the first step towards content search and selective browsing. Once a shot is detected, representative frames called xe2x80x9ckey framesxe2x80x9d are extracted to capture the evolution of the event, e.g., key frames can be identified to represent an explosion scene, an action chase scene, a romantic scene and so on. This simplifies the complex problem of processing many video frames of an image sequence to just having to process only a few key frames. The existing body of knowledge in low-level abstraction of scene content such as color, shape, and texture from still images can then be applied to extract the meta-data for the key frames.
While offering a simple solution to extract meta-data, the above description has no motion-related information. Motion information can considerably expand the scope of queries that can be made about content (e.g., queries can have xe2x80x9cverbsxe2x80x9d in addition to xe2x80x9cnounsxe2x80x9d). Namely, it is advantageous to have additional conditions on known information based on color, shape, and texture descriptors, be correlated to motion information to convey a more intelligent description about the dynamics of the scene that can be used by a search engine. Instead of analyzing a scene from a single perspective and storing only the corresponding meta-data, it is advantageous to capture relative object motion information as a descriptor that will ultimately support fast analysis of scenes on the fly from different perspectives, thereby enabling the ability to support a wider range of unexpected queries. For example, this can be very important in application areas such as security and surveillance, where it is not always possible to anticipate the queries.
Therefore, there is a need in the art for an apparatus and method for extracting and describing motion information in an image sequence, thereby improving image processing functions such as content-based indexing and retrieval, and various encoding functions.
One embodiment of the present invention is an apparatus and method for implementing object trajectory segmentation for an image sequence, thereby improving or offering other image processing functions such as context-based indexing of the input image sequence by using motion-based information. More specifically, block-based motion vectors are used to derive optical flow motion parameters, e.g., affine motion parameters. These optical flow motion parameters are employed to develop a prediction that is used to effect object trajectory segmentation for an image sequence.
Specifically, optical flow (e.g., affine) object motion segmentation is initially performed for a pair of adjacent frames. Namely, optical flow motion parameters between adjacent frames that describe the position of each point on a region at each time instant are made available to the present object trajectory segmenter. The present invention is not limited by the method or model that is employed to provide the initial optical flow motion parameters between adjacent frames.
The object trajectory segmenter applies the optical flow motion parameters to form a new prediction or method for predicting the positions of all the points on an object over time within an interval. For example, the optical flow motion parameters are code fitted to form the new prediction. The new prediction is then applied and the result is compared with an error metric. For example, the error metric measures the sum of deviations in distance at each point on the region at each time instant based on the new prediction compared to the original predictions. The results from such comparison with the error metric will dictate the proper intervals (temporal boundaries) of the image sequence at which the motion parameters are valid for various key objects. In other words, it is important to detect what motion segments or temporal boundaries are for a key object. In doing so, the present object trajectory segmenter obtains two sets of important information: the motion parameter values that accurately describe the object""s motion and for which frames the parameters are valid.
Namely, the optical flow (e.g., affine) motion parameters generated for each identified key object for each adjacent pair of frames are processed over an interval of the image sequence to effect object trajectory segmentation. Namely, motion trajectory such as direction, velocity and acceleration can be deduced for each key object over some frame interval, thereby providing an another aspect of motion information that can be exploited by query.