A. Field of the Invention
The present invention relates to storing and retrieving from large digital video archives video images encoded using the Moving Picture Experts Group (hereinafter referred to as MPEG) encoding standard or the Motion JPEG (Joint Photographic Experts Group), and specifically to the extraction, based on DC coefficients and motion vectors, of video sequence signatures from MPEG or Motion JPEG compressed videos, and the search for and retrieval of videos based on the extracted signatures.
B. Description of the Related Art
The MPEG video compression standard is described in Practical Digital Video With Programming Examples In C, by Phillip E. Mattison, John Wiley and Sons, 1994, chapter 11, pages 373 through 393, and in "MPEG: A Video Compression Standard for Multi-Media Applications", by Didier Le Gall, Communications of the ACM, April 1991, vol. 34, no. 4, pps. 47 through 58, and the JPEG video compression standard is described in "The JPEG Still Picture Compression Standard", by Gregory K. Wallace, Communications of the ACM, April 1991, vol. 34, no. 4, pps. 31 through 44, all of which are incorporated by reference herein.
Also applicable to video images digitally encoded using the Motion JPEG standard, the present invention is set forth herein with reference to video images digitally encoded using the MPEG encoding standard.
MPEG is an encoding standard used for digitally encoding of, typically with a computer, motion pictures for use in the information processing industry. With the MPEG encoding standard, video images can be stored on CD-ROMs, magnetic storage such as hard drives, diskettes, tape, and in random access memory (RAM) and read-only memory (ROM). Further, the MPEG encoding standard allows video images to be transmitted through computer networks such as ISDNs, wide area networks, local area networks, the INTERNET.TM., the INTRANET.TM., and other communications channels as explained below.
Video clips (which are also referred to as video streams) are sequences of an arbitrary number of video frames or images. An example of a video clip is images from is a television news show or other sources. MPEG video clips encoded as MPEG-video or using the MPEG system-layer encoding are eligible to have signatures extracted, in accordance with the present invention.
The MPEG encoding standard applies to video compression of images and the temporal aspects of motion video, taking into account extensive frame-to-frame redundancy present in video sequences.
In the MPEG encoding standard, the color representation is YCrCb, a color scheme in which luminance and chrominance are separated. Y is a luminance component of color, and CrCb are two components of chrominance of color. For each four pixels of luminance, one pixel of Cr and one pixel of Cb is present. In the MPEG encoding standard, the chrominance information is subsampled at one-half the luminance rate in both the horizontal and vertical directions, giving one value of Cr and one value of Cb for each 2.times.2 block of luminance pixels. Chrominance and luminance pixels are organized into 8.times.8 pixel blocks (or blocks). Pixel blocks are transformed into the frequency domain using the discrete cosine transform (DCT) operation, resulting in DC and AC components corresponding to the pixel blocks.
In the MPEG encoding standard, images in a sequence are represented by four types: I frame, P frame, B frame, or D frame. Further, in the MPEG encoding standard, each image is divided into slices, with a slice comprising one or more macro blocks. Slices are typically contiguous macro blocks.
A macro block comprises four 8.times.8 blocks of luminance pixels and one each 8.times.8 block of two chrominance (chroma) components. Therefore, a macro block comprises the DCT coefficients for four 8.times.8 blocks of luminance pixels and one 8.times.8 block for each of two chrominance coefficient pixels. Alternatively, the macro block may be encoded using forward or backward motion vectors, for B or P frames only. A forward motion vector of a frame is based on motion relative to a previous frame, while a backward motion vector of a frame is based on motion relative to a subsequent frame.
Within an image video clip, through the MPEG encoding standard, the value of a DC coefficient is encoded relative to the previous DC coefficient, with DC values for luminance being encoded relative to other luminance values and DC values for chrominance being encoded relative to chrominance values.
With the MPEG encoding standard, which comprises MPEG-video, MPEG-audio, and MPEG system-layer encoding (which incorporates MPEG-video, MPEG-audio, and information regarding how the two interact), motion video can be manipulated as data transmitted between computers, and manipulated within a computer.
Building large video archives which allow video clips to be stored, retrieved, manipulated, and transmitted efficiently requires the incorporation of various technologies. Examples of such technologies are video analysis, content recognition, video annotation, and browsing. From the user's point of view, the most important capability is efficient retrieval based on the content of video clips. The existing methods for content-based retrieval rely principally on the extraction of key frames or on text annotations.
Video browsing systems which rely on text retrieve video sequences based on key word annotation. The textual annotations which are normally stored separately can be indexed using full text retrieval methods or natural language processing methods.
Browsing systems which use key frames as a representative model for video sequences rely on the idea of detecting shot boundaries and choosing one or more frames as key frames. A shot is a contiguous number of video frames that convey part of a story. Most modern movies contain over a thousand cuts (a cut is a point in a video sequence in which there is a scene change, and is a change between shots), requiring an intelligent video retrieval program to process frames on the order of thousand frames per movie to give a coherent representation. For the user to see what is in the video, the user must preview the key frames in the above-mentioned browsing systems.
Further, the above-mentioned browsing systems use individual key frames and motion to search for video clips, without accounting for the sequence of key frames to represent the video clips when submitting the whole video clip as a query.
An alternative retrieval method for the browsing systems is by displaying particular frames of the video sequence to the user, allowing the user to review and select the particular video sequence. However, this alternative method is time consuming for the user.