The present invention relates to a hypothetical reference decoder.
A digital video system includes a transmitter and a receiver which assemble video comprising audio, images, and ancillary components for coordinated presentation to a user. The transmitter system includes subsystems to receive and compress the digital source data (the elementary or application data streams representing a program's audio, video, and ancillary data components); multiplex the data from the several elementary data streams into a single transport bit stream; and transmit the data to the receiver. At the receiver the transport bit stream is demultiplexed into its constituent elementary data streams. The elementary data streams are decoded and the audio and video data streams are delivered as synchronized program elements to the receiver's presentation subsystem for display as parts of a coordinated program.
In many video coding standards, a compliant bit stream to the decoder is decoded by a hypothetical decoder that is conceptually connected to the output of an encoder and consists of a decoder buffer, a decoder, and a display unit. This virtual decoder is known as the hypothetical reference decoder (HRD) in H.263 and the video buffering verifier (VBV) in MPEG-2. The encoder creates a bit stream so that the hypothetical decoder buffer does not overflow or underflow.
As a result, the quantity of data the receiver may be required to buffer might exceed its capacity (a condition of memory overflow) or throughput capabilities. Alternatively, the receiver may fail to receive all of the data in a data access unit in time for decoding and synchronized presentation with a specified instant in the audio or video data streams resulting in a loss of data and inconsistent performance (a condition of memory underflow).
In existing hypothetical reference decoders, the video bit stream is received at a given constant bit rate (usually the average rate in bits/sec of the stream) and is stored into the decoder buffer until the buffer fullness reaches a desired level. Such a desired level is denoted as the initial decoder buffer fullness and is directly proportional to the transmission or start-up (buffer) delay. At that point, the decoder instantaneously removes the bits for the first video frame of the sequence, decodes the bits, and displays the frame. The bits for the following frames are also removed, decoded, and displayed instantaneously at subsequent time intervals.
Traditional hypothetical decoders operate at a fixed bit rate, buffer size, and initial delay. However, in many of today's video applications (e.g., video streaming through the Internet or ATM networks) the available bandwidth varies according to the network path (e.g., how the user connects to the network: by modem, ISDN, DSL, cable, etc.) and also fluctuates in time according to network conditions (e.g., congestion, the number of users connected, etc.). In addition, the video bit streams are delivered to a variety of devices with different buffer capabilities (e.g., hand-sets, PDAs, PCs, Set-top-boxes, DVD-like players, etc.) and are created for scenarios with different delay requirements (e.g., low-delay streaming, progressive download, etc.). As a result, these applications require a more flexible hypothetical reference decoder that can decode a bit stream at different peak bit rates, and with different buffer sizes and start-up delays.
Jordi Ribas-Corbera and Philip A. Chou in a paper entitled, “A Generalized Hypothetical Reference Decoder For H.26L”, on Sep. 4, 2001, proposed a modified hypothetical reference decoder. The decoder operates according to N sets of rate and buffer parameters for a given bit stream. Each set characterizes what is known as a leaky bucket model and contains three values (R, B, F), where R is the transmission bit rate, B is the buffer size, and F is the initial decoder buffer fullness (F/R is the start-up or initial buffer delay). An encoder can create a video bit stream that is contained by some desired N leaky buckets, or can simply compute the N sets of parameters after the bit stream has been generated. The hypothetical reference decoder may interpolate among the leaky bucket parameters and can operate at any desired peak bit rate, buffer size, or delay. For example, given a peak transmission rate R′, the reference decoder may select the smallest buffer size and delay (according to the available leaky bucket data) that will be able to decode the bit stream without suffering from buffer underflow or overflow. Conversely, for a given buffer size B′, the hypothetical decoder may select and operate at the minimum required peak transmission rate.
There are benefits of using such a generalized hypothetical reference decoder. For example, a content provider can create a bit stream once, and a server can deliver it to multiple devices of different capabilities, using a variety of channels of different peak transmission rates. Or a server and a terminal can negotiate the best leaky bucket for the given networking conditions—e.g., the ones that will produce the lowest start-up (buffer) delay, or the one that will require the lowest peak transmission rate for the given buffer size of the device.
As described in Document VCEG-58 Sections 2.1-2.4, a leaky bucket is a model for the state (or fullness) of an encoder or decoder buffer as a function of time. The fullness of the encoder and the decoder buffer are complements of each other. A leaky bucket model is characterized by three parameters (R, B, F), where:                R is the peak bit rate (in bits per second) at which bits enter the decoder buffer. In constant to bit rate scenarios, R is often the channel bit rate and the average bit rate of the video clip.        B is the size of the bucket or decoder buffer (in bits) which smoothes the video bit rate fluctuations. This buffer size cannot be larger than the physical buffer of the decoding device.        F is the initial decoder buffer fullness (also in bits) before the decoder starts removing bits from the buffer. F and R determine the initial or start-up delay D, where D=F/R seconds.        
In a leaky bucket model, the bits enter the buffer at rate R until the level of fullness is F (i.e., for D seconds), and then b0 bits for the first frame are instantaneously removed. The bits keep entering the buffer at rate R and the decoder removes b1, b2, . . . , bn−1 bits for the following frames at some given time instants, typically (but not necessarily) every 1/M seconds, where M is the frame rate of the video. FIG. 1 illustrates the decoder buffer fullness along time of a bit stream that is constrained in a leaky bucket of parameters (R, B, F).
Let Bi be the decoder buffer fullness immediately before removing bi bits at time ti. A generic leaky bucket model operates according to the following equations:B0=FBi+1=min (B, Bi−bi+R(ti+1−ti)), i=0, 1, 2, . . .   (1)
Typically, ti+1−ti=1/M seconds, where M is the frame rate (normally in frames/sec) for the bit stream.
A leaky bucket model with parameters (R, B, F) contains a bit stream if there is no underflow of the decoder buffer. Because the encoder and decoder buffer fullness are complements of each other this is equivalent to no overflow of the encoder buffer. However, the encoder buffer (the leaky bucket) is allowed to become empty, or equivalently the decoder buffer may become full, at which point no further bits are transmitted from the encoder buffer to the decoder buffer. Thus, the decoder buffer stops receiving bits when it is full, which is why the min operator in equation (1) is included. A full decoder buffer simply means that the encoder buffer is empty.
The following observations may be made:                A given video stream can be contained in many leaky buckets. For example, if a video stream is contained in a leaky bucket with parameters (R, B, F), it will also be contained in a leaky bucket with a larger buffer (R, B′, F), B′>B, or in a leaky bucket with a higher peak transmission rate (R′, B, F), R′>R.        For any bit rate R′, the system can always find a buffer size that will contain the (time-limited) video bit stream. In the worst case (R′ approaches 0), the buffer size will need to be as large as the bit stream itself. Put another way, a video bit stream can be transmitted at any rate (regardless of the average bit rate of the clip) as long as the buffer size is large enough.        
Assume that the system fixes F=aB for all leaky buckets, where a is some desired fraction of the initial buffer fullness. For each value of the peak bit rate R, the system can find the minimum buffer size Bmin that will contain the bit stream using equation (1). The plot of the curve of R-B values, is shown in FIG. 2.
By observation, the curve of (Rmin, Bmin) pairs for any bit stream (such as the one in FIG. 2) is piecewise linear and convex. Hence, if N points of the curve are provided, the decoder can linearly interpolate the values to arrive at some points (Rinterp, Binterp) that are slightly but safely larger than (Rmin, Bmin). In this way, one is able to reduce the buffer size, and consequently also the delay, by an order of magnitude, relative to a single leaky bucket containing the bit stream at its average rate. Alternatively, for the same delay, one is able to reduce the peak transmission rate by a factor of four, or possibly even improve the signal-to-noise ratio by several dB.MPEG Video Buffering Verifier (VBV)
The MPEG video buffering verifier (VBV) can operate in two modes: constant bit rate (CBR) and variable bit rate (VBR). MPEG-1 only supports the CBR mode, while MPEG-2 supports both modes.
The VBV operates in CBR mode when the bit stream is contained in a leaky bucket model of parameters (R, B, F) and:R=Rmax=the average bit rate of the stream.                The value of B is stored in the syntax parameter vbv_buffer_size using a special size unit (i.e., 16×1024 bit units).        The value of F/R is stored in the syntax element vbv_delay associated to the first video frame in the sequence using a special time unit (i.e., number of periods of a 90 KHz clock).        The decoder buffer fullness follows the following equations:B0=FBi+1=Bi−bi+Rmax/M, i=0, 1, 2, . . .   (2)        The encoder must ensure that Bi−bi is always greater than or equal to zero while Bi is always less than or equal to B. In other words, the encoder ensures that the decoder buffer does not underflow or overflow.        
The VBV operates in VBR mode when the bit stream is constrained in a leaky bucket model of parameters (R, B, F) and:R=Rmax=the peak or maximum rate. Rmax is higher than the average rate of the bit stream.                F=B, i.e., the buffer fills up initially.        The value of B is represented in the syntax parameter vbv_buffer_size, as in the CBR case.        
The decoder buffer fullness follows the following equations:B0=BBi+1=min (B, Bi−bi+Rmax/M), i=0, 1, 2, . . .   (3)
The encoder ensures that Bi−bi is always greater than or equal to zero. That is, the encoder must ensure that the decoder buffer does not underflow. However, in this VBR case the encoder does not need to ensure that the decoder buffer does not overflow. If the decoder buffer becomes full, then it is assumed that the encoder buffer is empty and hence no further bits are transmitted from the encoder buffer to the decoder buffer.
The VBR mode is useful for devices that can read data up to the peak rate Rmax. For example, a DVD includes VBR clips where Rmax is about 10 Mbits/sec, which corresponds to the maximum reading speed of the disk drive, even though the average rate of the DVD video stream is only about 4 Mbits/sec.
Referring to FIG. 3A and 3B, plots of decoder buffer fullness for some bit streams operating in CBR and VBR modes, respectively, are shown.
Broadly speaking, the CBR mode can be considered a special case of VBR where Rmax happens to be the average rate of the clip.H.263's Hypothetical Reference Decoder (HRD)
The hypothetic reference model for H.263 is similar to the CBR mode of MPEG's VBV previously discussed, except for the following:                The decoder inspects the buffer fullness at some time intervals and decodes a frame as soon as all the bits for the frame are available. This approach results in a couple of benefits: (a) the delay is minimized because F is usually just slightly larger than the number of bits for the first frame, and (b) if frame skipping is common, the decoder simply waits until the next available frame. The latter is enabled in the low-delay mode of MPEG's VBV as well.        The check for buffer overflow is done after the bits for a frame are removed from the buffer. This relaxes the constraint for sending large I frames once in awhile, but there is a maximum value for the largest frame.H.263's HRD can essentially be mapped to a type of low delay leaky bucket model.        