The transmission of data is usually constrained by bandwidth and throughput limitations. One cannot send or receive an infinite amount of information in an infinitesimal amount of time. In order to maximize the amount and quality of information being transmitted, in some cases the information is compressed or coded for transmission and uncompressed or decoded upon reception.
One area in which data compression is essential is in the transmission of video data. Ordinary text, unless voluminous, is easily and quickly transmitted. However, video data can include aspects of color, brightness, and often stereo audio information. A large amount of data is required to define even short video clips. The transmission and coding of such data must be as efficient as possible, i.e., it must require as little information as possible to be transmitted.
Video compression is a subset of the general technique of data compression, whereby a signal is squeezed or compressed into a smaller set of numbers. These numbers will then take up less space on a hard drive, or take less time to transmit over a network. Before the numbers are used again, a decompression algorithm is applied to expand the series of numbers to its original (or at least a similar) form.
Video compression utilizes the fact that the signal is known to originate as digitized video, in order to increase the compression ratio, or the amount of squeezing that can be applied to the series of numbers to be stored or transmitted. Significant compression of video and audio are considered lossy algorithms because they discard or lose some portion of the original information; the reconstructed number series does not exactly match the original. This is acceptable because the precision with which we view video and audio, compared to the resolution of the digitization process, is not perfect. While the video signal may become slightly distorted, it is still recognizable. The degree to which a compression algorithm faithfully reproduces the original signal with minimum distortion or loss is a measure of the success of the algorithm.
There are a number of good reasons to compress video and audio signals, including technical issues and cost of equipment. one overriding issue is the cost of transmitting data. As the Internet matures into the de facto data transport platform for the 21st century, analog media such as videotape, film, and broadcast will be supplanted by a digital media infrastructure built on the Internet and Internet-related technologies. This digital infrastructure will allow data to be transferred between any two computing machines on the planet, if so desired. However, the speed at which this data can be sent will depend on a number of factors. In the limiting case, copper wires laid down over a century ago and intended for analog voice communications are used with modem technology (modem stands for Modulation/DEModulation) to transmit data at speeds as low as 9600 bits per second. Similar speeds are used to carry voice over wireless networks such as cellular. Recently, cable modem, DSL, and satellite technologies have brought six-figure data rates (100,000 to 1 million bits/second) to home users. For high-end applications, optical fiber enables data rates into the gigabit range (billions of bits per second) and beyond.
Whatever the data rate available for a given application, transmitting data costs money. At the present time, the cost of sending one megabyte (8 million bits) over the Internet usually costs anywhere from 5 cents at low volume, down to as low as one cent at extremely high volume (this figure does not include the cost at the receiving end). Therefore, the cost of transporting a megabyte of data from one place to another is always more than a penny.
Much work has been done in the field of video data compression. Some of the features of video codecs in existence include Discrete Cosine Transform compression, entropy coding, and differential coding of motion vectors. Prior codecs also utilize reference frames so that if a data packet is lost or corrupted, the data can be retrieved by referring to a reference frame. All of these features and difficulties therewith will be discussed in greater detail below.
In DCT (Discrete Cosine Transform) based video compression systems, an 8 by 8 block of pixel or prediction error signal data is transformed into a set of 64 frequency coefficients (a DC value and 63 AC values), which are then quantized and converted into a set of tokens.
Typically the higher frequency AC coefficients are smaller in magnitude and hence less likely to be non zero (i.e., more likely to be zero) following quantization. Consequently, prior to tokenization, the coefficients are often arranged in ascending order starting with the lowest frequency coefficient (the DC value) and finishing with the highest frequency AC coefficient. This scan order, sometimes referred to as “zig-zag order”, tends to group together the non-zero values at the start and the zero values into runs at the end and by so doing facilitates more efficient compression.
However, this fixed scan order is seldom optimal. For example, when encoding interlaced video material, certain high frequency coefficients are much more prominent. This fact is reflected in the prior art where there are examples of codecs (for example MPEG-2), that mandate an alternative scan order for use when coding interlaced video.
When optimizing a codec for a specific hardware device, it is important to make sure that full use is made of any facilities that the device may offer for performing multiple tasks in parallel and to limit the extent to which individual parts of the decode process become bottlenecks.
The instant invention's bitstream, in common with most other video codecs, can broadly speaking be described as comprising entropy coded tokens that can be divided into two main categories: predictor or P tokens and prediction error or E tokens. P tokens are tokens describing the method or mode used to code a block or region of an image and tokens describing motion between one frame and another. E tokens are used to code any residual error that results from an imperfect prediction.
Entropy coding is a process whereby the representation of a specific P or E token in the bitstream is optimized according to the frequency of that token in the bitstream or the likelihood that it will occur at a particular position. For example, a token that occurs very frequently will be represented using a smaller number of bits than a token that occurs infrequently.
Two of the most common entropy coding techniques are Huffman Coding and arithmetic coding. In Huffman coding each token is represented by a variable length pattern of bits (or a code). Arithmetic coding is a more computationally complex technique but it removes the restriction of using a whole number of bits for each token. Using an arithmetic coder, it is perfectly possible to code a very common token at an average cost of 2% of a bit.
Many multimedia devices have a co-processor unit that is well suited to the task of entropy coding and a more versatile main processor. Consequently, for the purpose of parallelization, the process of encoding or decoding a bitstream is often divided into entropy related tasks and non entropy related tasks. However, for a given video clip, as the data rate increases, the number of tokens to encode/decode rises sharply and entropy coding may become a bottleneck.
With a conventional bitstream it is very difficult to re-distribute the computational load of entropy coding to eliminate this bottleneck. In particular, on the decode side, the tokens must normally be decoded one at a time and in the order in which they were encoded. It is also extremely difficult to mix methods or entropy encoding (for example Huffman and arithmetic coding) other than at the frame level.
By convention, most modern video codecs code the (x, y) components of a motion vector, using a differential coding scheme. That is, each vector is coded relative to the previous vector. For example, consider two vectors (7,3) and (8,4). In this case the second vector would be encoded as (1,1), that is (7+1, 3+1).
This scheme works well if most blocks or regions for which a motion vector is coded exhibit motion that is similar to that of their neighbors. This can often be shown to be the case, for example when panning. However, it works less well if the motion field is irregular or where there are frequent transitions between background and foreground regions which have different motion characteristics.
For most modern video codecs, motion prediction is an important part of the compression process. Motion prediction is a process whereby the motion of objects or regions of the image is modeled over one or more frames and one or more ‘motion vectors’ is transmitted in the bitstream to represent this motion. In most cases it is not possible to perfectly model the motion within an image, so it is necessary to code a residual error signal in addition to the motion information.
In essence, each motion vector points to a region in a previously encoded frame that is similar to the region in the current frame that is to be encoded. The residual error signal is obtained by subtracting the predicted value of each pixel from the actual value in the current frame.
Many modern video codecs extend the process by providing support for prediction of motion to sub pixel accuracy, e.g, half-pixel or quarter-pixel motion estimation. To create fractional pixel data points, it is necessary to use some form of interpolation function or filter applied to real (i.e. full pixel aligned) data points.
Early codecs generally used simple bilinear interpolation as shown in Figure 1 attached hereto. In this example, A, B, C, and D are full-pixel aligned data points and x, y, and z are half-pixel aligned points. Point x is half-pixel aligned in the X direction and can be calculated using the equation:x=(A+B)/2.  (1)
Point y is half-pixel aligned in the Y direction and can be calculated using the equation:y=(A+C)/2.  (2)
Point z is half-pixel aligned in both X and Y can be calculated using the equation:z=(A+B+C+D)/2.  (3)
Later codecs have tended to move towards the use of more complex interpolation filters, such as bicubic filters, that are less inclined to blur the image. In the example shown in Figure 2, x is a half-pixel point that lies half way between two full pixel aligned pointes B and C. Using an integer approximation to a bicubic filter it can be calculated using the equation:x=(−A+9B+9C−D)/16.  (4)
Though filters such as the one illustrated above tend to produce sharper looking results, their repeated application over several frames can in some situations result in unpleasant artifacts such as false textures or false contouring.
When transmitting compressed video data over an unreliable or questionable data link, it is important that a mechanism exists for recovering when data is lost or corrupted, as video codecs are often extremely sensitive to errors in the bitstream.
Various techniques and protocols exist for the reliable transmission of data of such links, and these typically rely upon detection of the errors and either re-transmission or the use of additional data bits that allow certain types of error to be corrected. In many situations the existing techniques are adequate, but in the case of video conferencing over restricted bandwidth links neither of the above mentioned approaches is ideal. Re-transmission of lost data packets may not be practical because it is likely to cause an increased end to end lag, while the use of error correction bits or packets may not be acceptable in situations where bandwidth is already severely restricted.
An alternative approach is simply to detect the error at the decoder and report it to the encoder. The encoder can then transmit a recovery frame to the decoder. Note that this approach may not be appropriate if the error rate on the link is very high, e.g., more than one error in every 10-20 frames.
The simplest form of recovery frame is a key frame (or intra only frame). This is a frame that does not have any dependencies on previous frames or the data therein. The problem with key frames is that they are usually relatively large.