The present invention relates to the field of video decoding. In particular, the present invention relates to field of decoding of video sequences transmitted over erasure prone communication networks.
In 3rd Generation Partnership Project (3GPP) Packet Switched Conversational (PSC) services, as well as in many other packet-based video transmission systems e.g. those complying with ITU-T (International Telecommunication Union—Telecommunication Standardization Sector) Recommendation H.323 or Session Initiation Protocol (SIP; see also RFC 3261 (published by the Network Working Group of the Internet Engineering Task Force (IETF), which is part of the Internet Society (ISOC)) standard, compressed video (and encoded video, respectively) is conveyed over an IP/UDP/RTP (Internet Protocol/User Datagram Protocol/Real-Time transport Protocol) transport environment. In this environment, RTP (Real-Time transport Protocol) packets can get lost in transmission, especially when transmitted over erasure prone communication networks such as cellular networks for public land mobile communication services. A single RTP packet carries parts, one complete, or a multitude of complete encoded video frames, respectively. It should be noted that a frame can comprise either all pixels of a picture or only a subset known as a field. Hence, a picture comprises zero, one or two frames. The term “picture” will be used for all pixels to be reproduced at the same time, whereas the term “coded picture” shall be used for the compressed representation of a picture.
Encoded video is vulnerable to transmission errors and erasures. Since all modern video codecs are based on temporal prediction, missing information in the bit-stream leads not only to annoying artifacts in the reconstructed frame, in which the error occurred, but also in the following reconstructed frames, which may be predicted from one or more frames previous in time. In case of lacking error correction mechanisms, and depending on the content, the detonation in the frames may be amplified within a few seconds to a point, where the reconstructed video is no more useful.
To combat this situation, many tools have been developed, which can be divided into three categories.
The first category aims towards making the bit-stream itself less vulnerable, by inserting redundant information. Examples for these source coding based tools include segmentation, Independent decoder (IDR) macro-block (MB) refresh (known as intra macroblock refresh in older video compression standards), independent decoder frame/picture refresh (known as intra picture refresh in older video compression standards), flexible macro-block (MB) ordering, sub-sequences, and others (see for example Wang, Wenger et. al. “Error Resilient Video Coding Techniques”, IEEE Signal Processing Magazine, Vol. 17, No. 4, July 2000, ISSN: 1053-5888). The key aspect of these source-coding mechanisms is that they do not add significant delay. However, they are not very efficient from a compression and bandwidth-usage point-of-view, which is especially critical aspect when considering transmissions over mobile networks, which physical resources are limited and shared among several communication service subscribers.
The second category employs application layer feedback to inform the sender/transmitter/network source about losses perceived by the receiver. The sender may be instructed, for instance, to react by packet retransmission, frame retransmission, reference frame selection, intra/IDR coding of known-as-corrupt areas at the time the video encoder at receiver side learns about the loss situation (a technology known as error tracking) and other means. The use of feedback-based reliable transport protocols, e.g. TCP, may also fall in this category. Feedback-based mechanisms have the general disadvantage of requiring feedback channels, and hence are not applicable to some scenarios, e.g. unicast or highly unsymmetrical links and (point-to-) multipoint or broadcast multipoint communication. Furthermore, depending on the round-trip delay, many feedback-based mechanisms add too much delay for conversational applications. For example refer to Wang and Wenger discussing Feedback-based mechanisms in more detail.
The third category comprises mechanisms to reduce the erasure rate as perceived by the media receiver in the transport. Commonly used here are various forms of forward error correction (FEC), e.g. Audio Redundancy Coding (RFC 2198, published by the Network Working Group, IETF/ISOC) and packet-based forward error correction, which implementation is for instance disclosed in U.S. Pat. No. 6,141,788 by Rosenberg J. D. et al. as well as published as RFC 2733 (published by the Network Working Group, IETF/ISOC). A different scheme that is targeted towards conversational video communication is disclosed in U.S. Pat. No. 6,421,387 by Rhee I, who purposes a new forward error correction (FEC) technique, which is based on an error recovery scheme called Recovery from Error Spread using Continuous Updates (RESCU). Yet another, more sophisticated FEC mechanism is part of the 3GPP Technical Specification TS 26.346, “Multimedia Broadcast/Multicast Service (MBMS); Protocols and codecs (Release 6)” issued by the 3rd Generation Partnership Project (3GPP). Packet-based forward error correction works by generating one or more repair packets from a number of source packets, called source block. Many algorithms have been studied in this field, from simple XOR, over Reed-Solomon, to modern complex codes. At the receiver side, the repair packets allow the reconstruction of missing source packets. It should be mentioned that FEC is the more efficient and the better adjustable to the actual error rates the larger the FEC block is chosen. Large FEC source blocks (encompassing data of many video packets and requiring many hundred milliseconds or even seconds to transmit) are also beneficial to overcome bursts of packet losses which are common for wireless networks.
As a general rule, transport and feedback-based repair mechanisms are more efficient than bit-stream-based repair mechanisms. The precise operation point varies with the protection mechanism, the content, the required quality, and the compression mechanism. However, as a rough estimate, to combat 10% loss rate a typical FEC-based mechanism requires perhaps less than 15% additional bit rate (including overhead), whereas a source coding based mechanism could require at least 50% additional bit rate. On the other hand, the source coding based mechanisms are essentially neutral to the delay, whereas a FEC mechanism with 15% overhead (as assumed above) adds a delay of at least 7 frames/pictures, assuming a one-frame-one-packet strategy (which is common in 3GPP Packet Switched Conversational). Such an added delay is unacceptable from an application point-of-view, especially when considering conversational applications.