Video data typically includes luminance and chrominance data for each pixel in a frame. Raw digital video data contains too much information for transmittal over normal communication media and requires considerable storage capacity. Consequently, to utilize either the limited bandwidth of the communication media, or storage capacity efficiently, coding techniques are commonly used to compress the information contained in raw digital video data.
For example, in FIG. 1, video camera 101 generates an analog video signal that drives input processor 102 in encoding system 110. Input processor 102 digitizes and typically filters the analog video signal to produce a raw digital video signal. The raw digital video signal is encoded, i.e., compressed, by encoder 103.
The compressed digital video signal is transmitted over a communications channel, for example, a satellite link, to a decoding system 120 that includes a decoder 121, a post-processor 122, and a display driver 123. Decoder 121 decompresses the encoded video data and supplies the resulting signal to post-processor 122, which in turn smooths and enhances the video signal. The video signal from post-processor 122 supplies display driver 123 that drives display unit 130.
The encoding, i.e., compression, of video signals for storage or transmission and the subsequent decoding is well-known. Moreover, the effectiveness of the compression is increased if a priori information concerning the content of the raw digital video data is available and exploited.
An important factor in the efficiency of the encoding and decoding processes is the prediction efficiency. Most commonly used encoding processes employ a motion compensated prediction loop. A motion compensated prediction loop is included, for example, in the H.261, MPEG1 and MPEG2 video compression standards. See for example, ITU-T Recommendation H.261, "Codec for Audiovisual Services at px64 Kbps," Geneva 1993; ISO 11172-1 (MPEG1), "Coding of Moving Pictures and Associated Audio for Digital Storage Media at up to About 1.5 Mbps, Part 1," 1993; and CD ISO/IEC 13818-2 (MPEG2), "Generic Coding of Moving Pictures and Associated Audio," 1993, which are each incorporated herein by reference in its entirety.
Typically, in motion compensated prediction, a small portion of the current frame, usually a 16 pixel by 16 pixel block, is compared with a set of similarly sized blocks taken from a previously encoded frame, called the reference frame, which is stored in encoder 103. In the encoding process, a difference metric, such as mean-squared or mean-absolute difference between the current block and a reference block, is used for a comparison of the blocks. The block in the reference frame that best matches the current block, i.e., has the smallest difference metric, is chosen as the prediction block for the current block.
After the prediction block is selected, the difference between the current block and the prediction block, or in some cases, between the current block and a weighted prediction block, is computed to form a difference block. The difference block is then encoded and transmitted. Simultaneously, a vector pointing to the location of the prediction block in the reference frame is also transmitted as side channel information.
Decoder 121 has a copy of the reference frame in its memory. On receiving the prediction block vector, decoder 121 fetches the prediction block from the memory, and adds the prediction block to the decoded difference block to generate the new decoded block.
A motion compensated prediction loop is effective in increasing the encoding efficiency because most of the time there is a good correlation between successive frames of a sequence. Clearly, the better the prediction is, the better the encoding efficiency is, because a better prediction means less information in the difference block and so less information to encode and transmit.
The content of the raw digital data depends on several factors. Video signals of commercial interest have a variety of formats. Typically, the format of the video signal is determined by the source of the video data. For example, the video frames supplied to input processor 102 may represent the output of a vidicon adhering to the NTSC prescribed rate of 30 frames per second (fps), or the video frames may represent certain cartoon sequences that are produced at 15 fps and then converted to 30 fps by repeating each frame once. (Herein, the NTSC prescribed rate is taken as 30 fps, which is 60 fields per second. As is known by those skilled in the art, the actual NTSC prescribed rate is 29.97 fps, which is 59.94 fields per second.) Herein, a frame includes two fields, an odd parity field and an even parity field. The first field contains lines 1 through 262 of the frame and the second field contains lines 263 through 525 of the frame. Typical display monitors display the first field followed by the second field for each frame in the video data stream. Also, each video source 101 is said to produce a type of video data.
If the characteristics of the video data source could be reliably detected in real time, the characteristics could be exploited to increase the efficiency of the data compression. For example, there is no need to send the repeated frames generated in a 15 fps to 30 fps cartoon conversion. A somewhat more complicated, but analogous, case arises when a 24 fps progressively scanned film sequence, referred to herein as a movie sequence, is converted to match the conventional NTSC frame rates.
A movie sequence, which was originally produced on photographic film and typically shot at 24 fps, is converted to the NTSC rate of 60 fields/second by a telecine machine using a process commonly known as 3:2 pulldown that is described below. In this document, either "film" or "movie" refers to a video data type shot at 24 fps while "video" refers to a video data type originally produced at 60 fields/second.
A video sequence is typically interlaced which means that the odd and even parity fields of the same frame are temporally disjoint. A film sequence, in contrast, is progressive which means that all the information in a frame is captured at the same time instant. Also, there are some cases when the source may be shot at NTSC video rate of 30 fps and yet be progressive.
A 3:2 pulldown increases the frame rate of a movie sequence by periodically repeating certain fields in the film sequence. As an example, consider the 24 fps film sequence given in FIG. 2A. The letters "a b c . . . " denote successive frames in the sequence while the numbers "1" and "12" denote the odd and even parity fields, respectively, in each frame. In a general situation, frames a b c . . . can be assumed to be different. The telecine machine increases the frame rate to 30 fps by replacing the set of frames in FIG. 2A with the set of frames in FIG. 2B.
The sequence of frames in FIG. 2B has a five-frame periodicity. Each set of four frames of the original 24 fps sequence (FIG. 2A) is converted to five frames by repeating once the first and sixth fields of the four frame set. Moreover, the repetition is done in a manner that introduces two mixed-frames in the five frame set. In FIG. 2B, the mixed-frames are underlined. Mixed-field frames are those frames that contain fields from different temporal instants and are introduced in this case by the aforementioned frame rate conversion process. Herein, the term mixed-field frame and the term mixed-frame are used interchangeably.
From the point of encoder 103 such a 3:2 pulldown operation is doubly inefficient. First, part of the information is coded redundantly because of repetition of some fields. Secondly, forty percent of the frames are now "mixed" (a1b2, b1c2, e1f2 and f1g2 here), which means they are composed of temporally disjoint fields. This introduces interlace effects, which in turn result in a drop in the prediction efficiency of the frame-based encoding. For example, in a motion compensated prediction loop, if a mixed-frame is suddenly introduced into the sequence, the mixed-frame decreases significantly the probability of finding a good prediction block. As a result, the best possible difference block has a significant amount of information that must be encoded and transmitted.
Hence, appearance of mixed-field frames in the input video data sequence affects the compression performance of encoder 103 significantly. Thus, it would be beneficial to remove or, if possible, repair such frames before compression. Such processing could be performed in either real or non-real time. Unfortunately, the inventors are unaware of any prior art method or apparatus for removing or repairing such mixed-frames in real time.
A significant additional savings in data rate transmission could be achieved by eliminating transmission of one, all, or most repeated fields. Similarly, elimination of mixed-field frames would improve the prediction mechanism employed by commonly used video compression schemes.
There are other instances when mixed-field frames might be present in a video sequence. Sometimes mixed-field frames are intentionally created in the studio, but often mixed-field frames are accidentally introduced during editing operations. For example, in FIG. 2C a normal 30 fps interlaced video sequence is represented. An edit was intended to switch from a first scene to a new scene beginning with frame u1u2, but in the editing process, a mixed-field frame d1u2 was created (FIG. 2D). Mixed-field frame d1u2 is said to contain a scene cut. Similarly, in FIG. 2E, two telecine sequences were edited and merged, but the editing process created mixed-field frame c1v2 that contains a scene cut.
Thus, the mixed-field frames in FIGS. 2D and 2E were created by mid-frame edits. In general, mid-frame edits refer to an abrupt change in scene in the middle of a frame, i.e., the odd and even parity fields of a frame belong to different scenes. Such frames are also difficult to compress efficiently.
To meet a wide variety of user requirements, and bandwidth constraints, video compressors need to process video frames of different spatial resolutions, even if the original source material was all of the same resolution. Spatial resolution is altered by digitally filtering each frame of sequence before the frame is compressed. The filtering may be done in the horizontal direction, the vertical direction, or both. When filtering is done in the vertical direction, preferably the nature of the frame, i.e., whether the frame is interlaced or progressive, would be taken into account, because different modes of vertical filtering are required for interlaced and progressive frames to produce good visual quality in decoding system 120. Typically, interlaced frames have a better visual quality when the frames are vertically filtered on a field basis, while the opposite is true for progressive frames.