The present invention relates to the field of image processing and, more particularly, to a system and related methods for analyzing compressed media content.
With recent improvements in processing and storage technologies, many personal computing systems have the capacity to receive, process and render multimedia objects (e.g., audio, graphical and video content). The Motion Picture Expert Group (MPEG), under the auspices of the International Standards Organization (ISO) and the International Telecommunications Union (ITU), have developed a number of standards digitizing and compressing multimedia to accommodate high bandwidth (MPEG-1 and MPEG-2), and low bandwidth (MPEG-4) applications.
Those skilled in the art will appreciate that in order to accurately represent a multimedia object in the digital domain requires a vast amount of data. Accordingly, it is common to xe2x80x9ccompressxe2x80x9d the digitized multimedia object to satisfy storage media and/or communication bandwidth requirements. The MPEG coding includes information in the bit stream to provide synchronization of audio and video signals, initial and continuous management of coded data buffers to prevent overflow and underflow, random access start-up, and absolute time identification. The coding layer specifies a multiplex data format that allows multiplexing of multiple simultaneous audio and video streams as well as privately defined data streams.
The basic principle of MPEG system coding is the use of time stamps that specify the decoding and display the time of audio and video and the time of reception of the multiplexed coded data at the decoder. The basic scheme of MPEG is to predict motion from frame to frame in the temporal direction and then to use Discrete Cosine Transform (DCT) coefficients to organize the redundancy in the spatial directions.
In general, MPEG encodes video data into three types of frames, each frame comprising blocks of data and groups of blocks or macroblocks. A first frame type is an intra (I) frame. The I frame is a frame coded as a still image, not using any information from previous frames or successive frames. It is the first frame in the bit stream and contains texture and color information for the video sequence. A second frame type is a predicted (P) frame. The P frames in the bit stream are predicted from the most recently reconstructed I or P frame. Each block in the P frame can either have a vector and difference DCT coefficients or it can be intra coded, like the I frame. A third frame type is a bidirectional (B) frame. The B frame is predicted from the closest two I or P frames, one previous frame and one successive frame, in the bit stream.
When coding a compressed video sequence, the compression engine uses a forward vector and a rearward vector. The compression engine interpolates between two blocks of two frames to code the block currently being coded. If this is not possible, the block is intra-coded. The sequence of frames in a bit stream typically appears as IBBPBBPBBPBBIBBPBBPB . . . .
There are typically twelve frames from I to I. This is based on a random access requirement; typically, a starting point is needed at least once very 0.4 seconds. Providing an I frame every twelve frames correlates to starting with an I frame every 0.4 seconds. To decode the bit stream, the I frame is first decoded followed by the first P frame. The two B frames in between the I frame and the P frame are then decoded. The primary purpose of the B frames is to reduce noise in the video by filling in or interpolating a three dimensional video signal, between the I and the P frames, typically over a 33 or 25 millisecond picture period without contributing to the overall signal quality beyond that immediate point in time.
The B frames and P frames contain the motion information. The I frame has no motion values and stores the DCT information of the original frame of the video sequence.
As alluded to above, the MPEG-1 and MPEG-2 standards were developed to accommodate high-bandwidth applications (e.g., high definition television (HDTV)), while MPEG-4 was developed for lower bandwidth applications, e.g., interactive video systems. An interesting aspect of the MPEG-4 standard, necessary to support such interactive features, is that a video frame is actually defined as a number of video objects, each assigned to their own video object plane (VOP). Accordingly, once an MPEG-4 video frame is decompressed, a number of independently manipulatable video objects may be identified through any of a number of prior art video analysis techniques.
When the compressed multimedia object is accessed for use, it is decompressed in accordance with the compression scheme used to compress the multimedia object, e.g., using the inverse discrete cosine transform (IDCT). Once decompressed further analysis of the multimedia objects may then be performed.
The standardization of such coding and compression techniques has fostered growth in multimedia peripherals and applications. An example application of such technology provides for the identification and manipulation of video objects extracted from video content. For example, in a video sequence depicting a person running across the street, it may be desirable to identify and manipulate the image of the person (or any other video object within the video sequence). Typically, to identify and manipulate a video object, the video sequence is received/retrieved, decompressed (e.g., using the IDCT) and analyzed in a decompressed digital domain. More particularly, the application analyzes the decompressed digital representation of the video content to identify the person walking across the street, making that video object available for further manipulation by an end-user. These prior art techniques, while effective, have a number of inherent limitations.
First, the decompressed video objects are represented by an extremely large amount of data (as compared to the same video object(s) in the compressed digital domain). Not only does this large amount of data require a large amount of storage space (e.g., Random Access Memory (RAM)), it increases the computational complexity of the identification and manipulation process. Second, the decompression process is lossy, i.e., filters are typically employed to eliminate coding artifacts and to smooth the resultant image. Accordingly, such prior art multimedia applications were typically limited to high-end computing systems with advanced processors, large storage capacity and fast bus speeds executing sophisticated applications.
Thus, a system and related methods for analyzing multimedia objects is required, unencumbered by the above limitations commonly associated with the prior art. Indeed, what is required is a system and related methods that enable more moderate computing systems to benefit from recent advances in multimedia technology. Just such a system and related methods are presented in the pages to follow.
This invention concerns a system and related methods for analyzing media content in the compressed digital domain. According to a first implementation of the invention, a method comprising receiving media data in a compressed, digital domain, analyzing motion vectors of the received media data while in the compressed digital domain, and identifying one or more objects in the received media content based, at least in part, on the motion vector analysis. It is to be appreciated that identification of the one or more objects, i.e., dominant objects and subordinate objects, facilitates a host of applications to track and manipulate individual objects while in the compressed, digital domain.