Although the H.264 standard is specifically referenced below, the ideas and principles described are independent of a particular video coding standard and are equally valid for MPEG-2, ISO MPEG-4 part 10, AVC, or the emerging SMPTE VC-1 standard.
There is a constant demand in the video compression industry to continually increase the efficiency of video encoding, particularly in real-time applications such as television broadcasting, video conferencing, etc. The recent ITU-T H.264 standard is designed to meet this demand for increased efficiency at the cost of a corresponding increase in algorithm complexity. For instance, an H.264 video stream requires approximately half of the bit rate of an MPEG-2 video stream to achieve the same visual quality while the complexity of implementation of an H.264 encoder is an order of magnitude greater than for an MPEG-2 encoder.
In a typical application, an uncompressed video stream, made up of a sequence of pictures, is received by a video encoder and the video encoder creates an encoded version of each picture in the video sequence, thereby creating an encoded version of the uncompressed video stream. The encoded video stream is then transmitted to a video decoder over a constant bit rate (CBR) channel and the video decoder decodes the encoded video stream, thereby generating an uncompressed video stream that is ideally visually indistinguishable from the original uncompressed video stream.
The more bits that the encoder uses to create the compressed version of a picture of the video sequence, the longer it will take to transmit the compressed version of the picture over the CBR channel. Within the decoder, when encoded pictures are received, they are loaded into a decoder buffer to await decoding. The bits of an encoded picture are loaded into the decoder buffer sequentially as they arrive, but at the picture's decode time, all of the bits are removed from the buffer simultaneously.
In the simplified model above, over a particular period, the decoder buffer will receive a constant number of bits corresponding to a variable number of pictures. Over the same period, the decoder will remove a constant number of pictures from the decoder buffer, corresponding to a variable number of bits. If the encoder is transmitting many relatively large pictures during the period, then the number of pictures received during the period will be relatively small. This can cause the decoder buffer to empty, or underflow, as the decoder buffer may remove pictures at a faster rate than it is receiving new pictures. Conversely, if the decoder buffer is receiving many relatively small pictures during the period, then the number of pictures received during the period will be relatively large. This can cause the decoder buffer to overflow, as the decoder buffer may receive new pictures at a faster rate than it is removing pictures. Both underflow and overflow may cause a disruption in the uncompressed video stream generated by the decoder and therefore neither underflow nor overflow should be allowed to occur. It is therefore important that a video encoder considers the fullness of a downstream decoder's buffer while generating the encoded video stream. However, the decoder cannot communicate with the encoder and therefore the actual fullness of the decoder buffer is not available to the encoder. To this end, compressed video standards, such as H.264, define a hypothetical reference decoder (HRD) and the video encoder maintains a mathematical model of the HRD's coded picture buffer, generally called a virtual buffer. If the virtual buffer never overflow/underflows, then the decoder buffer will conversely never underflow/overflow. The encoder can then regulate the encoded video stream to avoid underflow or overflow of the downstream decoder buffer by sizing the encoded versions of the pictures of the video stream to maintain fullness of the virtual buffer at a safe level
To meet the computing requirements of newer, more computationally complex, video coding standards, multi-processor designs which use parallel processing can be implemented. For example a single processor could be assigned to encode I and P pictures, and the remaining processors are assigned to encode B pictures. More processors are used in the encoding of B pictures as more computing cycles are required to encode a B picture than to encode an I or P picture and there are generally significantly more B pictures in a given video stream than I and P pictures. For simplicity, it is assumed for this example that B pictures are not used as reference pictures, although the H.264 standard allows otherwise. In the following group of pictures (GOP) structure description, the subscript indicates decode (encode) order:                I0 B2 B3 P1 B5 B6 P4 B8 B9 P7 B11 B12 P10 B14 B15 I13 . . . .        
Table 1 shows, in simplified form, how the pictures of the above GOP could be distributed in a multi-processor encoder. In Table 1, each time slot u represents the real-time display duration for three pictures, so, if the input was NTSC video, each time slot would be 3003/30000 seconds in duration.
TABLE 1Processoru = 1u = 2u = 3u = 4u = 5u = 6u = 7u = 81I0P1P4P7P10I13P16P192B2B2B8B8B14B143B3B3B9B9B15B154B5B5B11B11B175B6B6B12B12B18encodedI0B2B5B8B11picturesP1B3B6B9B12availableP4P7P10I13
As shown in Table 1, processor 1 is used to encode all reference pictures, including I and P pictures. Processors 2, 3, 4 and 5 are used to encode B pictures. Thus, for this illustrative example, there is a pipeline delay of at least two time slots after the processing pipeline is initially filled. Note that each B picture is assigned two time slots while each I or P picture is assigned one time slot. As soon as the encoding of all necessary reference pictures is completed, the encoding of the referring B pictures begins in parallel. For example, the encoding of pictures B5 and B6 begins in parallel in processors 4 and 5 respectively as soon as the encoding of picture P4 is finished. Processor 1 keeps encoding I or P pictures regardless of activities in the other processors. The last row in Table 1 shows the encoded pictures as they become available in encode order. Table 1 is a simplified example. In practice, more activities are considered, such as receiving uncompressed pictures, encoding pictures, sending reconstructed pictures, receiving reconstructed pictures, necessary delay, etc. However, since these details are not essential to the present example, they have been omitted.
An important task of a video encoder is to find an appropriate balance between the desired image quality of the encoded video stream, the bit rate limitations of the channel over which the video stream is being transmitted, and maintaining a safe level of fullness in the decoder buffer. A rate control algorithm in the encoder uses the fullness of the virtual buffer and the relative complexity of individual pictures to calculate an appropriate allocation of bits to each picture of the video stream. In sequential encoding, the rate control algorithm checks the virtual buffer fullness upon the completion of the encoding of each picture before the encoding of the subsequent picture begins. In the case of the pipelined parallel encoder, the fullness of the virtual buffer is not immediately available to the rate control algorithm due to the pipeline delay induced by the simultaneous and non-sequential encoding of multiple pictures. Because all the processors in the pipelined parallel encoder operate independently of one another, the concept of “subsequent picture” must be replaced by “subsequent pictures after some fixed delay” and during that delay several additional pictures will be encoded, thus altering the fullness of the virtual buffer. For example, according to sequential encoding order, after completing the encoding of picture B6, the rate control algorithm checks the virtual buffer fullness before encoding picture P7. This is impossible for the parallel encoder described above, as the encoding of pictures B6 and P7 begins simultaneously, and the encoding of picture P7 is completed before the encoding of picture B6 is finished. Therefore the rate control algorithm will need to predict the fullness of the virtual buffer at some point in the future rather than simply checking the current value. A related issue with a parallel encoder that complicates the requirements for the rate control algorithm is the potential need for stuff bits. An encoder needs to insert stuff bits into the encoded video stream when the virtual buffer is empty or is in danger of becoming so. A sequential encoder's rate control algorithm knows exactly how many bits to stuff as soon as a picture is finished encoding. However a parallel encoder's rate control algorithm will need to calculate the needed number of stuff bits in a different way because of the unavailability of an accurate measure of the virtual buffer's true fullness due to the aforementioned pipeline delay. Another requirement of a parallel rate control algorithm is the ability to determine several bit rate targets for several pictures simultaneously. As shown in Table 1, at the beginning of time slot 4, for example, a parallel rate control algorithm needs to determine bit rate targets for pictures P7, B5, and B6.
It is widely recognized in the video compression industry that dual-pass encoding provides higher coding efficiency than single pass encoding. However the increase in efficiency would not outweigh the comparatively large cost of using two pipelined parallel encoders together in a dual pass architecture and so dual-pass encoding is not always a practical solution.
Referring to FIG. 1, a sequential dual pass encoder 1 receives an unencoded video stream as input and transmits an encoded single program transport stream (SPTS) as an output (which may be transmitted at either a variable bit rate (VBR) or a constant bit rate (CBR), depending on the application) and includes a 1st pass encoder 2, a 2nd pass encoder 3 and a storage and delay unit 4. The 1st pass encoder 2 is relatively simple compared to the 2nd pass encoder 3. For instance, the input to the 1st pass encoder 2 may be downsampled and compared to the input to the 2nd pass encoder 3. The 1st pass encoder 2 and the storage and delay unit 4 receive an uncompressed video stream as input. The storage and delay unit 4 buffers the video stream while the simple 1st pass encoder 2 calculates complexity information for each picture in the video stream. The pictures of the video stream and the corresponding complexity statistics are then transmitted to the 2nd pass encoder. The 2nd pass encoder 3 utilizes the complexity information generated by the 1st pass encoder 2 to create an encoded version of the input video stream. By using a simple 1st pass encoder 2 instead of a more sophisticated encoder at the input, the implementation cost is reduced close to that of a single-pass sophisticated encoder. However, because of the implementation differences between the 1st and 2nd pass encoder, the complexity information generated by the 1st pass encoder 2 is not exactly the information desired by the relatively sophisticated 2nd pass encoder 3.
Despite this deficiency, a correlation between picture complexity estimation in 1st pass encoding and picture complexity estimation in 2nd pass encoding exists. In most cases a picture or a group of pictures (GOP) that is relatively complicated/simple for the 1st pass encoder 2 is also relatively complicated/simple for the 2nd pass encoder 3. The complexity statistics still indicate important relationships among pictures and macro-blocks (MBs), with the error being tolerable. Therefore compared to single-pass sophisticated coding, the dual-pass encoder is superior in video coding efficiency with only a slightly higher implementation cost.