Video encoding is a commonly used technique for compressing video, i.e., reducing the amount of information needed to represent the video, for sake of conserving storage or transmission capacity. MPEG-2 is perhaps the most commonly used video encoding standard.
According to the MPEG-2 standard, each picture of a video sequence is divided into an m×n array of macroblocks. Each macroblock is a 2×2 array of blocks of luminance pixels, and each block of chrominance pixels overlaid thereon, wherein a block is an 8×8 array of pixels. Certain macroblocks are then motion compensated. A macroblock is motion compensated by identifying a prediction macroblock in another picture, called a reference picture, which closely resembles or matches the macroblock to be motion compensated. The prediction macroblock is then subtracted from the to-be-motion compensated macroblock. The prediction macroblock need not occupy precisely the same spatial coordinates as the to-be-motion compensated macroblock, and often there is a choice of reference pictures from which the prediction macroblock may be selected. A motion vector is used to identify the prediction macroblock by its spatial offset from the to-be-motion compensated macroblock and the reference picture in which it resides.
As noted above, the prediction macroblock is subtracted from the to-be-motion compensated macroblock to form a prediction error macroblock. The individual blocks of the prediction error macroblock are then spatially encoded. Some macroblocks are not motion compensated, either because a suitable prediction could not be found therefor or for refreshing purposes (defined below). Such macroblocks are said to be intracoded whereas macroblocks that are first motion compensated are said to be intercoded. The blocks of the intracoded and the interceded macroblocks are spatially compressed using the processes of discrete cosine transformation, quantization, (zig-zag or alternate) scanning, run-level encoding and variable length encoding. The macroblocks of selected pictures are also decoded and maintained in storage so that they can be used to reconstruct reference pictures for encoding other pictures. Decoded, reconstructed versions of the reference pictures are used for forming the predictions, in an effort to cause the encoder to use the same reconstructed reference pictures as are available to the decoder. However, for reasons described in greater detail below, the reference pictures reconstructed at the encoder will not always identically match the reference pictures in the decoder.
Pictures may be designated as one of three types, namely, intracoded (I) pictures, (forward only) predictively encoded (P) pictures and bidirectionally predictively encoded (B) pictures. I pictures contain only intracoded macroblocks. I pictures are used for random access, i.e., as an entry or cue point for presentation of video, as encoding may begin thereon. In addition, I pictures tend to reduce the propagation of errors and refresh the reference pictures in the decoder (as described in greater detail below). P pictures may contain both intracoded macroblocks and interceded macroblocks. However, the prediction macroblocks used to motion compensate the interceded macroblocks of a P picture may only originate in a reference picture which is presented before the P picture. B pictures may contain both intracoded and interceded macroblocks. Prediction macroblocks for B pictures may originate in a picture that is presented before the B picture, a picture that is presented after the B picture or an interpolation of the two. In order to reduce the memory requirements of a decoder which decodes B pictures, the reference pictures which are presented after the B picture are actually inserted into the encoded video signal before the B picture. The decoder can then easily decode the reference picture, that is to be presented after a given B picture, and have it available in advance of the arrival of the given B picture. As such, the reference picture is available for decoding the given B picture. I and P pictures may be used as reference pictures, but B pictures cannot be used as reference pictures.
The compressed video signal data formed using the above compression processes is formatted with header information and parameters inserted into the formatted video signal. In formatting the video signal, the video signal is hierarchically divided into the following layers: picture sequence, group of pictures, picture, slice, macroblock and block. The group of pictures and slice layers are optional. The group of pictures layer is useful for providing random access, as each group of pictures must start on an I picture. The slice layer is useful for providing error recovery. Each slice includes a contiguous sequence of adjacent macroblocks. Slices do not span more than one macroblock row but may include a variable length contiguous sequence of macroblocks in a given macroblock row. If an error is detected in the data of a slice, all subsequent data of that slice is either discarded or the errors in that slice are concealed using an error concealment process.
MPEG-2 supports both progressive and interlaced video. In the case of interlaced video, each macroblock of a frame can be selectively encoded as a frame macroblock or as two separate field macroblocks.
In the case of video conferencing, it is desirable to transmit low delay, low bit rate video signals. To reduce the delay, B pictures are preferably not used. This removes the latency associated with reordering pictures. (Recall that reference pictures that follow B pictures are inserted into the encoded video signal before the B pictures predicted from such reference pictures.) In addition, fields are coded as separate pictures. To reduce the bit rate, an I field picture is only used at the very beginning of the encoded video signal. Thereafter, each field picture is encoded as a P field picture. Since scene changes are unlikely to occur in video conference sessions, and the picture to picture motion is low, adequate video fidelity can be achieved, even at low bit rates and even though B pictures are not used.
As noted above, I pictures serve three purposes. One is providing random access, which is of low concern in a video conference. A second is to recover from errors, e.g., when the video signal is totally lost or partially corrupted. A third is to “refresh” the pictures—most notably, the portions of the (reference) pictures used for predicting other pictures.
Refreshing of pictures is of great concern in a video conference. Specifically, the discrete cosine transformers of encoders do not always match the inverse discrete cosine transformers of decoders, especially when made by different manufacturers. Thus, although both the encoder and the decoder use decoded reconstructed reference pictures to form predictions, a decoder might not produce precisely the same decoded reconstructed reference picture as the encoder. As a result, any prediction derived by the decoder from a reconstructed reference picture may diverge from (i.e., will have slightly different data than) the decoded, reconstructed reference picture used at the encoder to motion compensate subsequent pictures. The same is true of the decoded prediction error—the prediction error produced by the encoder may vary slightly from the prediction error decoded at the decoder. As such, a predicted picture decoded and reconstructed at a decoder will be slightly different than at the encoder. If this predicted picture is, in turn, used as a reference picture, then the divergence between the encoder and decoder reconstructed pictures will propagate and compound. Recall that a low bit rate video signal used in a video conference is formed as a sequence of a single I field picture followed by only P field pictures. Each P field picture will be predicted, at least in part, from a preceding P field picture. Thus, even if an error does not occur, intracoding is needed to prevent the propagating and compounding divergence of reconstructed pictures produced at the decoder relative to the reconstructed pictures produced at the encoder.
As noted above, it is preferable not to use I pictures (except at the very start) of a video conference application to maintain a low bit rate. Conventionally, a technique known as intra slice refresh is instead used in an attempt to alleviate the reference picture divergence problem. Specifically, slices are defined for each macroblock row. While MPEG-2 permits slices which span less than the full-width of a row, according to the intra slice refresh technique, each defined slice spans the entire width of the picture. A display screen can therefore display pictures which are each made up of a vertical sequence of contiguous slices (assuming that the display screen has the same dimensions as the picture). Over a fixed sequence of L>1 pictures, a different subset of slices is selected for refreshing. Each subset selected within a sequence of L pictures has slices with pixels at different row positions than each other subset selected within the same sequence of L pictures. Furthermore, the pixel rows of the slices of the intersection of all subsets over the sequence of L pictures includes each possible pixel row of the display screen. For example, one manner of selecting the subsets of slices is to select approximately P slices for refreshing each picture, where each selected slice in a picture is mutually vertically spaced from the closest other selected slices of that same picture. Thus, if the number of slice rows per field is 15 (vertically sequentially numbered 1 to 15), and P=4, then during the 1=1 st picture, slices in rows 1, 5, 9 and 13 are refreshed. During the 1=2nd picture, slices in rows 2, 6, 10 and 14 are refreshed. During the 1=3rd picture, slices in rows 3, 7, 11 and 15 are refreshed. During the 1=4th picture, slices in rows 4, 8 and 12 are refreshed. In short, over a sequence of L pictures, each row of pixels is refreshed exactly once. Stated another way, if the display screen can display a moving picture image formed by a sequence of displayed pictures, then over the sequence of L pictures, each row of pixels in the moving picture image is refreshed exactly once.
The intra slice refresh technique has drawbacks, however. First, as noted above, slices are designated for error isolation. Specifically, if an error is detected in a slice, each subsequent macroblock in a slice is discarded (or error concealment is applied to such macroblocks) until the next slice or layer is reached. By increasing the number of slices, the propagation of errors will be limited. However, this is not consistent with the primary purpose of intra slice refresh—the limiting of the divergence of reference pictures between the encoder and decoder.
Intra slice refresh does not accord well with interlaced pictures. Specifically, field pictures are often highly correlated within the same frame, especially in a low motion sequence of pictures as is typical in a video conference. Intra slice refreshing techniques do not consider this and tend to refresh each field component of a slice during different frames.
Moreover, the refreshing over a sequence of pictures by refreshing sequential full-length slices, that span the entire width of the frame, tends to produce a visible artifact. Often, this visible artifact appears as a visually discernible vertical scrolling band (corresponding to the refreshed slices) over the moving picture image.
Lastly, it should be noted that each slice begins with a slice header of a certain number of bits. Thus, a certain amount of overhead bits of the compressed video bitstream must be allocated specifically for such slice headers in order to perform the intra slice refresh technique. In other words, it would not be possible to omit the slice headers altogether (in all, or only certain, encoded pictures, for example, to conserve bandwidth) and still refresh using the intra slice refresh technique.
Accordingly, it is an object of the present invention to overcome the disadvantages of the prior art.