1. Field of the Invention
Embodiments of this invention relate generally to the implementation of a packet recovery mechanism for the robust transport of live or real-time media streams over packet-switched networks. Such media streams may consist of an audio and a video component or any combination of audio and video or other time-sensitive signals. The packet-switched network may include Internet connections and IP networks in general. More specifically, such embodiments relate to forward error correction (FEC) mechanisms optimized for robust, low-latency, and bandwidth-efficient transport of audio and video streams over packet-switched networks.
2. Description of the Related Art
Random congestion through packet-switched networks, such as the Internet, adds an unpredictable amount of jitter and packet loss to the transport of video and audio packet streams. Furthermore the most efficient video compression, variable bit-rate (VBR) coding, produces large bursts of data that further add to network congestion, compounding potential router overflow and the resulting packet loss. Thus, the number of packets that a network might drop and the instantaneous packet rate may fluctuate greatly from one moment to the next.
In addition to contending with packet delivery problems, maintaining low latency is a critical constraint for video conferencing and other applications having interaction between the viewer and subject. Some examples of applications where low-latency is critical are: security, where an operator may desire to control the pan/tilt/zoom of a remote camera to follow suspicious activity; and telemedicine, to enable a doctor to remotely diagnose a patient.
Forward Error Correction (FEC) potentially provides a low-latency method for correcting packet loss. FEC adds a fixed percentage of additional packets, called checksum packets, to a block of data packets such that the loss of one or more data packets in the block, within some predetermined bound, can be recovered by combining the checksum packets with those data packets that had been successfully received in order to reconstruct the missing data packets.
Various forms of FEC have long been applied to digital audio-video streams, most notably for satellite transmission and most recently for Internet streaming, to help minimize the adverse impact of channel impairments on the audio-video signal. Advantages of FEC over other error correction mechanisms include scalability to large systems because of its inherent multicast compatibility, and the fact that latency and distance between source and destination does not have any intrinsic effect since FEC does not require feedback.
Forward Error Correction
Variable packet loss rates and variable video bit rates, coupled with the need to minimize latency, present challenges to the implementation of FEC techniques for protecting packetized media streams. FEC augments a media stream with redundant data, called checksum packets, to help restore stream integrity based upon anticipated levels of packet loss. FEC groups data packets into an FEC block. The checksum packets generated from a given block are said to cover that block since missing data packets can be restored by combining the remaining checksum and data packets in that block. FEC coverage, the number of missing data packets that FEC can recover within the same block, is limited to the number of checksum packets within that block.
One of the strengths of FEC is that it has the potential to immediately reconstruct lost data upon receipt of the appropriate checksum packets, without the need to wait for retransmissions from the source. Thus for networks with long round-trip travel times, FEC may significantly reduce latency as compared with feedback-based error correction, such as Automatic Repeat reQuest (ARQ).
However without knowledge of the actual packet loss at the receiver, a transmitter implementing FEC may not provide a sufficient number of checksum packets for packet recovery. Since FEC implementations generally transmit a fixed number of checksum packets, often calculated as a constant percentage of the number of data packets regardless of packet content, such FEC implementations would not be able to efficiently handle large instantaneous variations in packet loss rates. In such cases, either FEC bandwidth overhead would be excessive and inefficient, or else the FEC coverage would be inadequate for complete packet recovery.
Furthermore without knowledge of the instantaneous bit rate, FEC processing at a receiver may wait an indeterminate amount of time for of all checksum packets pertaining to a data block to arrive before recovering lost data from that data block. For example, if every 10 data packets generate one checksum packet to form an FEC block, and a network drops one data packet in delivering this block, then the receiver would have to wait for the arrival of the 10th packet, the checksum packet, before it could recover the missing data packet. However under VBR coding of the stream, the time required to receive these 10 packets can vary considerably. Waiting for the checksum packet to arrive delays the stream at the receiver and creates a burst of packets as the receiver accumulates each FEC block for FEC processing. Thus, the burstiness of FEC receiver processing adds jitter, and ultimately latency, to a recovered signal.
FEC also generates burstiness and jitter at a transmitter. As mentioned, standard FEC implementations generate checksum packets as a fixed percentage of the number of outgoing data packets. These implementations wait for all packets in a data block to have been generated before creating and sending the associated checksum packets for the block. As for VBR streams at the receiver, a fixed-percentage FEC checksum generator at the transmitter would also have to wait an indeterminate amount of time for enough data packets to accumulate and fill the FEC block before the checksum packet generator could complete checksum generation for that block.
This variation in FEC processing delay produces jitter in the recovered data stream that must be smoothed out by adding input packet buffering at the receiver. However, such input buffering to recover from VBR FEC-induced jitter adds to the overall stream latency. Thus as a further limitation of fixed-percentage FEC, large variations in stream packet rate as a result of VBR encoding result in long system latency.
Even when the bit rate is constant, as with Constant Bit Rate (CBR) coding, conventional FEC can introduce large amounts of jitter and latency to a real-time audio or video stream. A common FEC technique for protecting against large contiguous burst drops is interleaving. An example of interleaving is found in the Pro-MPEG Forum's Code of Practice #3 standard for FEC for video over IP networks (Ref. #1). One implementation of interleaving writes packets sequentially along rows of a two-dimensional matrix. When a data packet fills the last data row of the matrix, the FEC engine computes a final checksum row, generating one parity packet for each column to fill the checksum row, and then sends the entire checksum row as a burst of parity packets. (Parity packets are computed by calculating the Exclusive-OR across corresponding bits of all packets of a block.)
At a transmitter, interleaving delays the generation of checksum packets until the interleaving matrix has been filled, where it then creates a burst of checksum packets. The receiver inputs an incoming stream as blocks of data packets followed by this burst of checksum packets. At the receiver, interleaving introduces a processing delay equal to the time required to fill the entire receiver's matrix. The receiver waits for the last data and parity packet within a block to arrive before it applies the received parity packets to the received block of data packets to recover any missing packets. (If the last packet in a block was lost, then either a timeout, the appearance of a packet from a following block, or a combination of both may force FEC immediate processing for the current FEC block.) Thus, interleaving introduces processing jitter both at the transmitter and at the receiver as a result of the periodic processing time in waiting to fill the interleave matrix.
Video and Audio Stream Compression
The Motion Picture Experts Group (MPEG), a working group of the International Organization for Standardization (ISO), has defined a set of compression specifications for the efficient coding of audio and video digital streams. The generations of this video compression and encoding standard are known as MPEG-1, MPEG-2, and MPEG-4, with MPEG-4 being the latest member of this family of standards.
MPEG video compression encodes video as a sequence of two main types of frames: key interval snapshots, called I-Frames, and motion difference increments encoded in one of two ways, called B- or P-Frames, depending upon whether they encode motion differences using future and past frames (Bi-directional) or only past frames (Predictive). For efficient compression I-Frames typically occur infrequently, ranging from once every 500 milliseconds for professional broadcast applications to several seconds for Internet video conferencing applications, while a fixed pattern of B- and P-Frames (called Group of Pictures, or GOP) would fill the gap between I-Frames. This succession of I, B, and P video frames occurs at the video frame rate, typically having a constant 33.3 millisecond interval between frames. I-Frames are generally much larger than B- or P-Frames, often by orders of magnitude, as they have to encode all the detail of the basic compressed snapshot picture that the motion B- and P-Frames use as a base. For example, an I-Frame may typically consist of a dozen or more 1,500 byte Internet Protocol (IP) packets, while a typical B or P Frame often resides in a single packet or just a few packets.
The loss of a single packet of an I-Frame may invalidate an entire I-Frame, or at minimum result in severe macro blocking unless some form of error concealment is implemented. The invalidation of an entire I-Frame would be equivalent to the loss of all the packets comprising that I-Frame. Assuming that all packets have the same probability of being lost or corrupted during transport over a congested packet-switched medium, such as the Internet, the larger size of an I-Frame relative to B and P Frames makes I-Frames relatively more susceptible to loss or corruption. Conversely, since B- and P-Frames generally fit within a relatively few number of packets, the probability of losing a B- or P-Frame is substantially less. Even if the I-Frame were not completely invalidated, macro blocking may appear that could persist until the next I-Frame.
Furthermore, loss of a single I-Frame may disrupt a video stream for several seconds, until receipt of the next I-Frame. This occurs because the B- and P-Frames that follow an I-Frame must build upon that last I-Frame. In contrast, the loss of a single B-Frame may result in a disruption as low as a single frame period, about 33 milliseconds, since it may depend only on the P-Frames on either side of it. P-Frames only rely on the preceding I- or P-Frame. These facts further emphasize the importance of protecting I-Frames, relative to recovering lost B or P Frames.
Recent research in video forward error correction has validated the conclusion that I-Frames deserve the most FEC protection. A paper titled, “A Model for MPEG with Forward Error Correction and TCP Friendly Bandwidth,” published by the ACM in the NOSSDAV '03 Conference (Ref. #2), analyzed the effectiveness of varying the Group of Pictures (GOP) MPEG coding parameter and varying the number of FEC packets for each type of frame (I, P, & B) on the playable frame rate for the recovered video stream. They adopted an underlying constraint of an upper limit on throughput to make such video streams friendly to other network usage. Thus in their tests, increasing FEC overhead cut directly into bandwidth allotted for video, and therefore reduced the playable frame rate, just as lost packets would reduce the number of delivered frames and also lower playable frame rate at the receiver. Thus, they were able to compute an optimal level of FEC that maximized the playable received video frame rate under various packet loss levels.
After an exhaustive analysis of all reasonable combinations of GOP parameters and FEC overhead for the 3 frame types, they found that varying GOP had little effect on the playable frame rate. Not surprisingly, they also determined that FEC was most effective when I-Frames had the most FEC coverage, followed by the P-Frames. In their calculations of optimal FEC coverage for maximizing playable video frame rate, they provided no FEC coverage to B-Frames in simulations where the network packet loss rate was 5% and less, and only provided one FEC checksum packet for B-Frames at all higher network loss probabilities. In general, their FEC optimizations provided about half the FEC protection for P-Frames as provided for I-Frames.
Their analysis was not meant to provide, nor did it teach, an FEC implementation for general video streams, but rather to show that FEC can indeed improve the received playable video frame rate under the assumption of limited bandwidth. They only optimized FEC for a single high bit rate of video stream and allocated a fixed pattern of FEC coverage to that stream, based upon the ratio of I, P, and B-Frames of their high-bit-rate model stream. In fact, for video conferencing applications and other applications where a single packet may hold B or P Frames, their allocations would result in wasteful FEC allocations. For example, allocating one FEC packet for each B-Frame would result in 100% FEC overhead for B-Frames, even though, in their own analysis, the loss of B-Frames least affects the playable video rate.
Furthermore, their predetermination of FEC overhead would be extremely inefficient when using variable bit rate (VBR) video compression. As we previously mentioned, VBR produces the most efficient video compression, and is therefore the type of compression that all commercial DVDs use today. In VBR, the size of the various video frames changes significantly throughout the stream. Thus any pre-allocation of FEC for various compression frame types results in very inefficient and widely variable FEC coverage.
Neither this paper, nor any other work with which we are familiar discusses FEC techniques that limit the latency under VBR streams, where the receiver would have to wait for a variable number of packets before it can apply FEC checksum packets to restore a stream. This work also does not address the addition of audio packets to the stream.
Audio uses a completely different encoding mechanism from video. For both video conferencing and video streaming applications, audio is often encoded with a high degree of compression. For speech, compressed bit rates typically range from 8,000 or 11,025 bits per second. Furthermore, audio packetized for IP networks often contain 1,000 to 1,500 bytes of compressed audio signal. At a real-time streaming bit rate of 8,000 bps, the loss of a single audio packet represents a second or more of sound loss. MP3 compression of high-fidelity audio often produces audio streams as low as 56 Kbps. The loss of a single 1,500 byte MP3 audio packet would cause a playout gap of over 200 milliseconds. Thus, at these high compression rates and because of the relatively large amount of sound contained in each packet, even a single dropped audio packet can result in very pronounced audio disruptions at a receiver. Thus in addition to I-Frames, the loss of even a single audio packet could be noticeable, and therefore audio packets also require a high degree of protection.