In the era of multimedia communications, many technical challenges are incurred in the processing of digital video, due to its large amount of data involved and limited channel bandwidth in practice. For example, in teleconferencing or videophone application, how to transmit the digital video (say, acquired through digital camera) to the receiver in real time for visual communications requires compression process. As a result, the original amount of video data could be greatly reduced by discarding those redundant information while keeping those essential ones as much intact as possible in order to maintain the original video quality at the receiver side after reconstruction. Such video processing is so-called digital video coding.
A basic method for compressing the amount of digital color video data for fitting into the bandwidth has been adopted by the Motion Picture Experts Group (NPEG), which produces MPEG-1, MPEG-2, and MPEG-4 compression standards. MPEG achieves high data compression by utilizing Discrete Cosine Transform (DCT) technique for the intra-coded pictures (called I-frames) and motion estimation and compensation technique for the inter-coded pictures (called P-frames or B-frames). I-frames occur only every so often and are the least compressed frames; thus, yielding highest video quality and being used as reference anchor frames. The frames exist between the I-frames are P-frames and/or B-frames generated based on nearby I-frames and/or existing P-frames. The fast motion estimation for generating motion vectors is conducted for the P-frames and B-frames only. A typical frame structure could be IBBPBBPBBPBBPBB IBBPB . . . , being repeated so until the last video frame. The so-called Group of Picture (GOP) begins with an I-frame and ends on the frame that is proceeded by the next I-frame. In the above example, the size of GOP is 15.
For generating motion vectors by performing motion estimation, each P-frame is partitioned into smaller blocks of pixel data; typically, 16×16 in size, called macroblock (MB) in MPEG's jargon. Then, each MB will be shifted around its neighborhood on the previous I-frame or P-frame in order to find out the most resembled block within the imposed search range. Hence, only the motion vector of the most resembled block is recorded and used to represent the corresponding MB. The motion estimation for the B-frame will be conducted similarly but in both directions, forward prediction and backward prediction.
Note that fast motion estimation methods can be directly exploited into all existing international video-coding standards as well as any proprietary compressions system that adopts similar motion-compensated video coding methodology, as they all share exactly the same approach as above-mentioned in reducing temporal redundancy. Besides MPEG, another set of video coding standards, ITU's H.261, H.263, and H.26L, for teleconferencing or videophone applications also require such motion vector generation.
Since the above-mentioned exhaustive search typically requires large portion (about three-quarters) of total processing time consumed at a typical video encoder. Hence, fast algorithm is indispensable to the realization of real-time visual communications services. For that, we invented a scalable fast motion estimation technique for performing fast motion estimation. The scalability is useful to meet different requirements, such as implementation, delay, distortion, computational load and robustness, while minimizing the incidences of over-search (thus, increasing delay) or under-search (thus, might be increasing distortion). For example, in multimedia PC environment and with such scalable implementation, the user can have few choices in selecting the video quality mode for different visual communications applications, and even under different Internet traffic situations and/or type of services. For example, in videophone, small delay in conversation is probably the most important requirement, for trading off reduced video quality. In another application scenario, a different fast motion estimation algorithm can be selected for creating a high-quality video email (if so desired) and to be sent later on. In this case, it is an off-line video application, from which the delay is not an issue. Another example in the so-called object-oriented based video coding where multiple video objects are identified, activating one of the block-matching motion estimation profiles can flexibly generate the motion vectors associated with each video object.
After generating the MVs, certain simple statistical measurements (say, mean and variance) of the MVs can be easily computed to yield a “content-complexity indicator” (in MPEG-4 video coding standard, a pneumonic, called ƒ_code). Such indicator is useful to capture or snapshot each video frame in a summarized way. For example, based on the category information of the ƒ_code, one can easily locate where are the duration of the shots that contain high-motion activity.
The segmented regions that correspond to their associated video object respectively can form an alpha-plane mask, which is basically a binary mask for each video object and for each individual frame, contrasting from the background. Based on such alpha-plane information, the user can easily engage interactive hypermedia-like functionality with the video frames. For example, the user can click on any video object of interest at any time, say, a fast-moving racing car, then an information box will be popped up and provide some pre-stored information, such as the driver's name and age, past driving record and Grand Prizes awarded, and other relevant information. Note that each video object has its own associated information box, and its trajectory can be served as a reliable linkage of the alpha-plane masks of the same video object.
The generated motion vectors as above-mentioned could be further processed for conducting intelligent content-based indexing and retrieval. For example, how to search relevant multimedia materials (say, video clips) over large database and retrieve those containing identical or very similar content to that of the query would be very desirable to many applications, such as Internet search engine and digital library. Rather than relying on conventional approach, that is, keywords only, the so-called content-based search is fairly promising and effective in achieving the above-mentioned objective, since the “content”, like color, texture, shape, video object's motion trajectory, and so on, are often hard, and sometimes impossible, to describe in words. Therefore, the content-based search of multimedia materials is powerful and effective to facilitate this purpose. Obviously, it is not a trivial task, and in fact, needs a suite of intelligent processes. Besides other prominent features such as color, textures, shape, and so on, motion trajectory is another important key feature to digital video. In this invention, the content is specifically meant for the motion trajectory of video object identified from the given digital video clip. The remaining of this invention presents such method that is capable of automatically identifying multiple moving video objects and then simultaneously tracking them based on their motion trajectories generated, respectively. In this scenario, the user can impose a query by drawing a curve on computer, say, a parabola curve to signify a diver's diving action in order to search those video clips that contain such video content.
Our invention essentially provides a fundamental core technology that mimics human being's capabilities on detecting moving video objects and tracking the objects' individual movement to a certain degree. A typical application that can benefit from this invention is as follows. In the environment of security surveillance, intruded moving objects can be automatically detected, and the trajectory information can be used to steer the video cameras to follow the movement of the video objects while recording the incidences. Another application example can be found in digital video indexing and retrieval.