A video sequence consists of a series of still pictures or frames. Video compression methods are based on reducing the redundant and perceptually irrelevant parts of video sequences. The redundancy in video sequences can be categorised into spectral, spatial and temporal redundancy. Spectral redundancy refers to the similarity between the different colour components of the same picture. Spatial redundancy results from the similarity between neighbouring pixels in a picture. Temporal redundancy exists because objects appearing in a previous image are also likely to appear in the current image. Compression can be achieved by taking advantage of this temporal redundancy and predicting the current picture from another picture, termed an anchor or reference picture. In practice this is achieved by generating motion compensation data that describes the motion between the current picture and the previous picture.
However, sufficient compression cannot usually be achieved by only reducing the inherent redundancy of the sequence. Thus, video encoders also try to reduce the quality of those parts of the video sequence which are subjectively less important. In addition, the redundancy of the encoded bit-stream is reduced by means of efficient lossless coding of compression parameters and coefficients. The main technique is to use variable length codes.
Video compression methods typically differentiate between pictures that utilise temporal redundancy reduction and those that do not. Compressed pictures that do not utilise temporal redundancy reduction methods are usually called INTRA or I-frames or I-pictures. Temporally predicted images are usually forwardly predicted from a picture occurring before the current picture and are called INTER or P-frames. In the case of INTER frames, the predicted motion-compensated picture is rarely precise enough and therefore a spatially compressed prediction error frame is associated with each INTER frame. INTER pictures may contain INTRA-coded areas.
Many video compression schemes also use temporally bi-directionally predicted frames, which are commonly referred to as B-pictures or B-frames. B-pictures are inserted between anchor picture pairs of I- and/or P-frames and are predicted from either one or both of these anchor pictures. B-pictures normally yield increased compression compared with forward-predicted pictures. B-pictures are not used as anchor pictures, i.e., other pictures are not predicted from them. Therefore they can be discarded (intentionally or unintentionally) without impacting the picture quality of future pictures. Whilst B-pictures may improve compression performance compared with P-pictures, their generation requires greater computational complexity and memory usage, and they introduce additional delays. This may not be a problem for non-real time applications such as video streaming but may cause problems in real-time applications such as video-conferencing.
A compressed video clip typically consists of a sequence of pictures, which can be roughly categorised into temporally independent INTRA pictures and temporally differentially coded INTER pictures. Since the compression efficiency in INTRA pictures is normally lower than in INTER pictures, INTRA pictures are used sparingly, especially in low bit-rate applications.
A video sequence may consist of a number of scenes or shots. The picture contents may be remarkably different from one scene to another, and therefore the first picture of a scene is typically INTRA-coded. There are frequent scene changes in television and film material, whereas scene cuts are relatively rare in video conferencing. In addition, INTRA pictures are typically inserted to stop temporal propagation of transmission errors in a reconstructed video signal and to provide random access points to a video bit-stream.
Compressed video is easily corrupted by transmission errors, mainly for two reasons. Firstly, due to utilisation of temporal predictive differential coding (INTER frames), an error is propagated both spatially and temporally. In practice this means that, once an error occurs, it is easily visible to the human eye for a relatively long time. Especially susceptible are transmissions at low bit-rates where there are only a few INTRA-coded frames, so temporal error propagation is not stopped for some time. Secondly, the use of variable length codes increases the susceptibility to errors. When a bit error alters the codeword, the decoder will lose codeword synchronisation and also decode subsequent error-free codewords (comprising several bits) incorrectly until the next synchronisation (or start) code. A synchronisation code is a bit pattern which cannot be generated from any legal combination of other codewords and such codes are added to the bit stream at intervals to enable re-synchronisation. In addition, errors occur when data is lost during transmission. For example, in video applications using the unreliable UDP transport protocol in IP networks, network elements may discard parts of the encoded video bit-stream.
There are many ways for the receiver to address the corruption introduced in the transmission path. In general, on receipt of a signal, transmission errors are first detected and then corrected or concealed by the receiver. Error correction refers to the process of recovering the erroneous data perfectly as if no errors had been introduced in the first place. Error concealment refers to the process of concealing the effects of transmission errors so that they are hardly visible in the reconstructed video sequence. Typically some amount of redundancy is added by the source or transport coding in order to help error detection, correction and concealment.
There are numerous known concealment algorithms, a review of which is given by Y. Wang and Q.-F. Zhu in “Error Control and Concealment for Video Communication: A Review”, Proceedings of the IEEE, Vol. 86, No. 5, May 1998, pp. 974-997. and an article by P. Salama, N. B. Shroff, and E. J. Delp, entitled “Error Concealment in MPEG Video Streams over ATM Networks”, IEEE Journal on Selected Areas in Communications, Vol. 18, No. 6, June 2000.
Current video coding standards define a syntax for a self-sufficient video bit-stream. The most popular standards at the time of writing are International Telecommunications Union ITU-T Recommendation H.263, “Video coding for low bit rate communication”, February 1998; International Standards Organisation/International Electro-technical Commission ISO/IEC 14496-2, “Generic Coding of Audio-Visual Objects. Part 2: Visual”, 1999 (known as MPEG-4); and ITU-T Recommendation H.262 (ISO/IEC 13818-2) (known as MPEG-2). These standards define a hierarchy for bit-streams and correspondingly for image sequences and images. Development of further video coding standards is still ongoing. In particular, standardisation efforts in the development of a long-term successor for H.263, known as H.26L and further developments of MPEG video coding are now being conducted jointly under the auspices of a standardisation body known as the Joint Video Team (JVT) of ISO/IEC MPEG (Motion Pictures Expert Group) and ITU-T VCEG (Video Coding Experts Group).
By default, these standards use the temporally previous anchor (I, EI, P, or EP) picture as a reference for temporal prediction. Generally, this information is not transmitted, i.e. the bit-stream does not contain information relating to the identity of the reference picture. Consequently, a decoder that receives an encoded video bit-stream has no means of detecting whether a reference picture has been lost. Many transport coders packetise video data in such a way that they associate a sequence number with the each of the data packets they produce. However, this kind of sequence number is not related to the video bit-stream. For example, a section of a video bit-stream may contain the data for P-picture P1, B-picture B2, P-picture P3, and P-picture P4 captured (and to be displayed) in this order. However, this section of the video bitstream would be compressed, transmitted, and decoded in the following order: P1, P3, B2, P4, since B2 requires both P1 and P3 before it can be encoded or decoded. If the transport coder packetises each picture into a single packet having a sequence number and the packet carrying B2 is lost, the receiver can detect loss of the data packet from the packet sequence numbers. However, the receiver has no means to detect if it has lost a motion compensation reference picture for P4 or if it has lost a B-picture, in which case it could continue decoding normally.
The decoder therefore usually sends an INTRA request to the transmitter and freezes the picture on the display. However the transmitter may not be able to respond to this request. For instance in a non-real-time video streaming application, the transmitter cannot respond to an INTRA request from a decoder. Therefore the decoder freezes the picture until the next INTRA frame is received. In a real-time application such as video-conferencing, the transmitter may not be able to respond at all. For instance, in a multi-party conference, the encoder may not be able to respond to individual requests. Again the decoder freezes the picture until an INTRA frame is output by the transmitter.