1. Field of the Invention
The present invention relates to image processing, and, in particular, to video compression processing.
2. Description of the Related Art
The primary goal in video compression processing is to reduce the number of bits used to represent sequences of video images while still maintaining an acceptable level of image quality during playback of the resulting compressed video bitstream. Another goal in many video compression applications is to maintain a relatively uniform bit rate, for example, to satisfy transmission bandwidth and/or playback processing constraints. Video compression processing often involves the tradeoff between bit rate and playback quality. This tradeoff typically involves reducing the average number of bits used to encode images in the original video sequence by selectively decreasing the playback quality of each image that is encoded into the compressed video bitstream.
Many video compression systems, such as those based on an MPEG (Moving Picture Experts Group) standard, gain much of their compression capability by making predictions from other, previously coded pictures. Although the term xe2x80x9cframexe2x80x9d is used throughout in this specification, those skilled in the art will understand that the teachings of this specification apply generally to video pictures, a term that covers both video frames and video fields.
MPEG coders have three main types of frames: I, P, and B. An I frame is coded independently without reference to any other frames. A P frame is coded as the motion-compensated difference between itself and a reference frame derived from the previously coded P or I frame. I and P frames are referred to as anchor frames, because they can be used to generate reference frames for coding other frames. Macroblocks in a B frame are coded as the difference between itself and either (1) the previous anchor frame (i.e., forward prediction), (2) the next anchor frame (i.e., backward prediction), or (3) the average of the previous and next anchor frames (i.e., interpolated or bidirectional prediction). B frames are non-anchor frames that are never used to predict other frames. Thus, errors in B frames do not propagate to other frames and are one picture in duration. Note that the human visual system objects less to errors of very short duration.
Although the MPEG standards make no restrictions on a particular sequence of frame types, many coders simply use a repeating pattern of I, P, and B frames. Since B frames can be predicted from not only a previous frame, but a future frame as well, B frames must be sent to the decoder after the anchor frames that surround them. To make this xe2x80x9cout-of-orderxe2x80x9d decoding efficient, the frames are encoded into the corresponding compressed video bitstream out of temporal order.
FIG. 1 shows a block diagram of a conventional video compression system 100 for reordering and encoding a stream of video frames into a compressed video bitstream. System 100 implements a video coding scheme that is based on a repeating frame pattern having two B frames between each pair of consecutive anchor frames (e.g., IBBPBBPBBPBBPBBPBB for a 15-frame GOP (group of pictures)). Table I in FIG. 2 shows the relationship between the temporal order of frames (as they appear in the input video stream) and the order in which those frames are coded into a compressed video bitstream by system 100. Table I also shows the tap position of switch 104 used to reorder the video frames in order to generate the bitstream.
Frames are presented at the video input of system 100 in temporal order starting with Frame 0, then Frame 1, etc. As each new frame is presented at the video input, the frame stored in frame-delay buffer 102c is made available at tap T0 and the new frame is made available at tap T3. Depending on the position selected for two-position switch 104, encoder 106 codes either the frame at tap T0 or the frame at tap T3. As encoder 106 codes the selected frame, the frame stored in frame-delay buffer 102b is moved into frame-delay buffer 102c, the frame stored in frame-delay buffer 102a is moved into frame-delay buffer 102b, and the new frame is stored into frame-delay buffer 102a. 
At the beginning of a video stream, when Frame 0 is presented at the video input and therefore at tap T3, switch 104 is positioned at tap T3 to enable encoder 106 to encode Frame 0 as an I frame (i.e., I0 in Table I). Processing of encoder 106 is then temporarily suspended until all the frame-delay buffers 102 are filled, such that Frame 0 is stored in buffer 102c and presented at tap T0, Frame 1 is stored in buffer 102b, Frame 2 is stored in buffer 102a, and Frame 3 is presented at the video input and at tap T3. At this time, switch 104 is again positioned at tap T3 so that Frame 3 can be coded as a P frame (i.e., P3 in Table I).
In the next processing cycle, Frame 1 is stored in buffer 102c and presented at tap T0, Frame 2 is stored in buffer 102b, Frame 3 is stored in buffer 102a, and Frame 4 is presented at the video input and at tap T3. At this time, switch 104 is positioned at tap T0 so that Frame 1 can be coded as a B frame (i.e., B1 in Table I).
In the next processing cycle, Frame 2 is stored in buffer 102c and presented at tap T0, Frame 3 is stored in buffer 102b, Frame 4 is stored in buffer 102a, and Frame 5 is presented at the video input and at tap T3. At this time, switch 104 is again positioned at tap T0 so that Frame 2 can be coded as a B frame (i.e., B2 in Table I).
In the next processing cycle, Frame 3 is stored in buffer 102c and presented at tap T0, Frame 4 is stored in buffer 102b, Frame 5 is stored in buffer 102a, and Frame 6 is presented at the video input and at tap T3. At this time, switch 104 is repositioned at tap T3 so that Frame 6 can be coded as a P frame (i.e., P6 in Table I).
This processing is continued for each frame in each 15-frame GOP in the video stream with switch 104 positioned at tap T0 to code a B frame and at tap T3 to code an anchor (I or P) frame according to the GOP pattern (IBBPBBPBBPBBPBB), as indicated in Table I.
Some video streams contain flash frames. For purposes of this specification, a sequence of flash frames is defined as set of one or more consecutive frames that are relatively poorly correlated to both the frame immediately preceding the flash sequence and the frame immediately following the flash sequence, where the frames immediately before and after the flash sequence are themselves relatively well-correlated to each other. A common example of a flash sequence is the phenomenon produced by still picture photographers at events, such as basketball games. A photographer""s flash usually produces, in a video stream, a single frame that is mostly white, or at least with an intensity much higher than the frames both before and after. Such a flash frame (i.e., a one-frame flash sequence) will be poorly correlated to the temporally surrounding frames.
Some encoders are able to detect xe2x80x9cscene cutsxe2x80x9d by looking for a pair of consecutive frames that are highly uncorrelated to one another, where the degree of correlation may be characterized using a distortion measure, such as the mean absolute difference (MAD) of the motion-compensated interframe pixel differences. In response, such encoders may insert an I frame at the next scheduled anchor frame time (i.e., potentially replacing a regularly scheduled P frame with an I frame). Such encoders will mistakenly identify a flash sequence as a scene cut, based on the large distortion between the first frame in the flash sequence and its immediately preceding frame. Such a scene cut will be detected for individual, isolated flash frames as well as multi-frame flash sequences.
Assuming that the events that cause single flash frames (e.g., photographers"" flashes) occur randomly with respect to the timing of the repeating GOP pattern, on average, a flash frame will fall on an anchor (I or P) frame 1 out of 3 times for the 15-frame GOP pattern of Table I. When that occurs, the encoder will identify the flash frame as a scene cut and code the flash frame as an I frame. Even if the encoder does not detect and adjust its processing for scene cuts, ⅓ of all flash frames on average will still be coded as anchor frames.
However, coding a flash frame as an I frame is a very bad idea, since, in that case, the flash frame will become the anchor frame for predicting the remainder of the frames in the GOP, but it will be poorly correlated to the other frames in the GOP, and the entire GOP (typically xc2xd second) will be badly coded (i.e., high quantization level required to meet limited bit rate requirements).
For example, in the sequence shown in Table I, assume that Frame 6 is an isolated flash frame. According to the GOP pattern, Frame 6 is to be predicted from Frame 3 for encoding as a P frame (i.e., P6). Since Frame 6 is a flash frame, it is probably poorly correlated to Frame 3. As a result, P6 will either require too many bits to render well, or it will be badly coded (i.e., large quantization errors). Furthermore, Frame 6 is the prediction frame for encoding Frame 9 as a P frame. Here, too, since flash-frame Frame 6 will probably be poorly correlated to Frame 9, Frame 9 will either exceed its budgeted bit allocation or it too will be badly coded. If Frame 9 is badly coded, then Frames 7 and 8 which are to be encoded as B frames B7 and B8, respectively, will have a bad choice of being predicted from an unrelated flash frame (P6) or a badly coded frame (P9). Either way, B7 and B8 will also probably be badly coded.
Next, the errors from P9 will propagate to Frame 12, since Frame 9 is Frame 12""s predictor. If enough bits are spent, some of these errors may be reduced. Again, B frames B10 and B11 will suffer, either in picture quality or efficiency. The net effect is that a single, badly correlated flash frame can cause many frames to be badly coded, thereby adversely affecting the quality of the video playback for a significant number of frames.
The present invention is directed to a scheme for detecting and coding sequences of one or more flash frames in video streams. According to the present invention, the occurrence of a sequence of one or more consecutive flash frames is detected in a video stream by looking for a short sequence of frames in which the one or more frames in the sequence are fairly poorly correlated to the frames immediately preceding and following the sequence, while those frames immediately preceding and following the sequence are fairly well-correlated to one another. The coder then takes an appropriate action to code the flash sequence in an efficient manner. For example, in one possible implementation in which each sequence of flash frames contains only one frame, the isolated flash frames are coded as B frames, no matter where they would otherwise fall in the repeating GOP pattern of I, P, and B frames. In that case, the errors that occur in encoding the flash frame will be limited to the flash frame alone, since a B frame is never used as a predictor for other frames. Other coding options are also possible in alternative implementations.
According to one embodiment, the present invention is based on a method for processing a video stream. A flash sequence is detected in the video stream, wherein the flash sequence is a set of one or more consecutive pictures in which: (1) a picture preceding the flash sequence is poorly correlated to the flash sequence; (2) a picture following the flash sequence is poorly correlated to the flash sequence; and (3) the picture preceding the flash sequence is well-correlated to the picture following the flash sequence. Video processing is adjusted based on the detection of the flash sequence to generate part of a compressed video bitstream corresponding to the video stream.
According to another embodiment, the present invention is a system for processing a video stream, comprising: (a) a plurality of delay buffers connected in series; (b) a multi-tap switch configured to be positioned to receive picture data from an output of any one of the delay buffers; (c) a video encoder configured to the switch to receive and code the picture data into a compressed video bitstream corresponding to the video stream; and (d) a flash detector configured to detect a flash sequence in the video stream. The flash sequence is a set of one or more consecutive pictures in which: (1) a picture preceding the flash sequence is poorly correlated to the flash sequence; (2) a picture following the flash sequence is poorly correlated to the flash sequence; and (3) the picture preceding the flash sequence is well-correlated to the picture following the flash sequence. The video encoder adjusts video processing based on the detection of the flash sequence by the flash detector to generate part of the compressed video bitstream.