The following invention relates to systems and methods for encoding and decoding data of all types, and more particularly to systems and methods for encoding and decoding data using chain reaction codes.
Transmission of data between a sender and a recipient over a communications channel has been the subject of much literature. Preferably, but not exclusively, a recipient desires to receive an exact copy of data transmitted over a channel by a sender with some level of certainty. Where the channel does not have perfect fidelity (which covers most of all physically realizable systems), one concern is how to deal with data lost or garbled in transmission. Lost data (erasures) are often easier to deal with than corrupted data (errors) because the recipient cannot always tell when corrupted data is data received in error. Many error-correcting codes have been developed to correct for erasures and/or for errors. Typically, the particular code used is chosen based on some information about the infidelities of the channel through which the data is being transmitted and the nature of the data being transmitted. For example, where the channel is known to have long periods of infidelity, a burst error code might be best suited for that application. Where only short, infrequent errors are expected a simple parity code might be best.
Another consideration in selecting a code is the protocol used for transmission. In the case of the Internet, a packet protocol is used for data transport. That protocol is called the Internet Protocol or “IP” for short. When a file or other block of data is to be transmitted over an IP network, it is partitioned into equal size input symbols and input symbols are placed into consecutive packets. The “size” of an input symbol can be measured in bits, whether or not the input symbol is actually broken into a bit stream, where an input symbol has a size of M bits when the input symbol is selected from an alphabet of 2M symbols. In such a packet-based communication system, a packet oriented coding scheme might be suitable.
A transmission is called reliable if it allows the intended recipient to recover an exact copy of the original file even in the face of erasures in the network. On the Internet, packet loss often occurs because sporadic congestion causes the buffering mechanism in a router to reach its capacity, forcing it to drop incoming packets. Protection against erasures during transport has been the subject of much study.
The Transport Control Protocol (“TCP”) is a point-to-point packet control scheme in common use that has an acknowledgment mechanism. Using TCP, a sender transmits ordered packets and the recipient acknowledges receipt of each packet. If a packet is lost, no acknowledgment will be sent to the sender and the sender will resend the packet. With protocols such as TCP, the acknowledgment paradigm allows packets to be lost without total failure, since lost packets can just be retransmitted, either in response to a lack of acknowledgment or in response to an explicit request from the recipient.
Although acknowledgment-based protocols are generally suitable for many applications and are in fact widely used over the current Internet, they are inefficient, and sometimes completely infeasible, for certain applications as described in Luby I.
One solution that has been proposed to solve the transmission problem is to avoid the use of an acknowledgment-based protocol, and instead use Forward Error-Correction (FEC) codes, such as Reed-Solomon codes, Tornado codes, or chain reaction codes, to increase reliability. The basic idea is to send output symbols generated from the content instead of just the input symbols that constitute the content. Traditional erasure correcting codes, such as Reed-Solomon or Tornado codes, generate a fixed number of output symbols for a fixed length content. For example, for K input symbols, N output symbols might be generated. These N output symbols may comprise the K original input symbols and N-K redundant symbols. If storage permits, then the server can compute the set of output symbols for each content only once and transmit the output symbols using a carousel protocol.
One problem with some FEC codes is that they require excessive computing power or memory to operate. Another problem is that the number of output symbols must be determined in advance of the coding process. This can lead to inefficiencies if the loss rate of packets is overestimated, and can lead to failure if the loss rate of packets is underestimated.
For traditional FEC codes, the number of possible output symbols that can be generated is of the same order of magnitude as the number of input symbols the content is partitioned into. Typically, but not exclusively, most or all of these output symbols are generated in a preprocessing step before the sending step. These output symbols have the property that all the input symbols can be regenerated from any subset of the output symbols equal in length to the original content or slightly longer in length than the original content.
“Chain Reaction Coding” as described in U.S. Pat. No. 6,307,487 entitled “Information Additive Code Generator and Decoder for Communication Systems” (hereinafter “Luby I”) and in U.S. patent application Ser. No. 10/032,156 entitled “Multi-Stage Code Generator and Decoder for Communication Systems” (hereinafter “Raptor”) represents a different form of forward error-correction that addresses the above issues. For chain reaction codes, the pool of possible output symbols that can be generated is orders of magnitude larger than the number of the input symbols, and a random output symbol from the pool of possibilities can be generated very quickly. For chain reaction codes, the output symbols can be generated on the fly on an as needed basis concurrent with the sending step. Chain reaction codes have the property that all input symbols of the content can be regenerated from any subset of a set of randomly generated output symbols slightly longer in length than the original content.
Other descriptions of various chain reaction coding systems can be found in documents such as U.S. patent application Ser. No. 09/668,452, filed Sep. 22, 2000 and entitled “On Demand Encoding With a Window” and U.S. patent application Ser. No. 09/691,735, filed Oct. 18, 2000 and entitled “Generating High Weight Output symbols Using a Basis.”
Some embodiments of a chain reaction coding system consist of an encoder, and a decoder. Data may be presented to the encoder in the form of a block, or a stream, and the encoder may generate output symbols from the block or the stream on the fly. In some embodiments, for example those described in Raptor, data may be pre-encoded off-line using a static encoder, and the output symbols may be generated from the plurality of the original data symbols and the static output symbols.
In some embodiments of a chain reaction coding system, the encoding and the decoding process rely on a weight table. The weight table describes a probability distribution on the set of source symbols. That is, for any number W between 1 and the number of source symbols, the weight table indicates a unique probability P(W). It is possible that P(W) is zero for substantially many values of W, in which case it may be desirable to include only those weights W for which P(W) is not zero.
In some embodiments of a chain reaction coding system the output symbols are generated as follows: for every output symbol a key is randomly generated. Based on the key, a weight W is computed from the weight table. Then a random subset of W source symbols is chosen. The output symbol will then be the XOR of these source symbols. These source symbols are called the neighbors or associates of the output symbol hereinafter. Various modifications and extensions of this basic scheme are possible and have been discussed in the above-mentioned patents and patent applications.
Once an output symbol has been generated, it may be sent to the intended recipients along with its key, or an indication of how the key may be regenerated. In some embodiments, many output symbols may make up one transmission packet, as for example described in the U.S. patent application Ser. No. 09/792,364, filed Feb. 22, 2001 and entitled “Scheduling of multiple files for serving on a server.”
In certain applications it may be preferable to transmit the source symbols first, and then to continue transmission by sending output symbols. Such a coding system is referred to herein as a systematic coding system. On the receiving side, the receiver may try to receive as many original input symbols as possible, replace the input symbols not received by one or more output symbols and use them to recover the missing input symbols. The transmission of output symbols may be done proactively, without an explicit request of the receiver, or reactively, i.e., in response to an explicit request by the receiver. For example, for applications where no loss or a very small amount of loss is anticipated, it might be advantageous to send the original input symbols first, and to send additional output symbols only in case of erasures. This way, no decoding needs to be performed if there were no losses. As another application, consider the transmission of a live video stream to one or more recipients. Where there is expectation of some loss, it may be advantageous to protect the data using chain reaction coding. Because of the nature of a live transmission, the receiver may be able to buffer a specific part of the data only for at most a predetermined amount of time. If the number of symbols received after this amount of time is not sufficient for complete reconstruction of data, it may be advantageous in certain applications to forward the parts of the data received so far to the video player. In certain applications, and where appropriate source coding methods are used, the video player may be able to play back the data in a degraded quality. In general, where applications may be able to utilize even partially recovered data, it may be advantageous to use a systematic coding system.
Straightforward modifications of embodiments of chain reaction coding systems as described in Luby I or Raptor to produce systematic coding systems generally leads to inefficiencies. For example, if in a chain reaction coding system the first transmitted symbols comprise the original symbols, then it may be necessary to receive a number of pure output symbols which is of the same order of magnitude as the original symbols in order to be able to recover the original data. In other words, reception of the original symbols may only minimally help the decoding process, so that the decoding process has to rely entirely on the other received symbols. This leads to an unnecessarily high reception overhead.
What is therefore needed is a systematic version of a chain reaction coding system, which has efficient encoding and decoding algorithms, and has a similar reception overhead as a chain reaction coding system.