This invention relates to distribution of data files and other data objects using IP multicast techniques in conjunction with forward error correction and data carousel techniques. In particular, the invention relates to methods of receiving, buffering, and decoding data objects distributed in this manner.
The existence and popularity of the Internet has created a new medium for software distribution. As this distribution method becomes more widely used, it will place more and more demands on Internet bandwidth. Thus, it will be important to distribute files and other data objects as efficiently as possible.
Currently, data objects are distributed to individual network clients upon request. When a data object is requested, it is packaged in a plurality of IP (Internet Protocol) packets and transmitted to the requesting client. If another client requests the same data object, the IP packets are re-transmitted to that client. Thus, each request results in a full re-transmission of the entire data object over the network.
This type of data distribution is very inefficient. The inefficiencies become serious in certain situations where there is a rush to obtain a particular data object that has only recently become available. This situation has been dubbed the Midnight Madness problem because the mad dash for files often takes place late at night or in the early morning when files are first made available. Spikes in Internet activity have been caused by a range of phenomena: popular product releases; important software updates; security bug fixes; the NASA Pathfinder vehicle landing on Mars; the Kasparov vs. Deep Blue chess match; and the Starr report. The danger of such traffic spikes lies not in the data type, but rather in the data distribution mechanism.
The Midnight Madness problem is caused by the Internet""s current unicast xe2x80x9cpullxe2x80x9d model. A TCP (Transmission Control Protocol) connection is established between a single sender and each receiver, then the sender transmits a full copy of the data once over each connection. The sender must send each packet many times, and each copy must traverse many of the same network links. Naturally, the sender and links closest to the sender can become heavily saturated. Nonetheless, such a transmission can create bottlenecks anywhere in the network where over-subscription occurs. Furthermore, congestion may be compounded by long data transfers, either because of large files or slow links.
These problems can be alleviated through the use of IP multicast protocols. IP multicast is a method of distributing data in which the data is sent once from a data server and routed simultaneously to all requesting clients. Using this method, the sender sends each packet only once, and the data traverses each network link only once. Multicast has been commonly used for so-called xe2x80x9cstreamingxe2x80x9d data such as data representing audio or video. Typically, multicast is used to transmit live events such as news conferences or audio from broadcast radio stations.
FIG. 1 shows a network system utilizing IP multicasting. The system includes a data server 10 and a plurality of clients 12 and 13. The system also includes a plurality of routers 14 that route data along different communications links to the receiving clients. In this case, only the five clients referenced by numeral 12 have requested the data stream, while the clients referenced by numeral 13 have not requested the data stream. The data stream is forwarded to the requesting clients 12, as indicated by the shaded arrows. However, the data stream is not forwarded to non-requesting clients 13, thus preserving bandwidth on the links to those clients.
IP multicast provides a powerful and efficient means to transmit data to multiple parties. However, IP multicast is problematic for transfers of data objects which must be transmitted reliably, such as files. IP multicast provides a datagram servicexe2x80x94xe2x80x9cbest-effortxe2x80x9d packet delivery. It does not guarantee that packets sent will be received, nor does it ensure that packets will arrive in the order they were sent.
Many reliable file transfer protocols have been built on top of multicast. However, since scalability was not a primary concern for most of these protocols, they are not useful for the midnight madness problem. The primary barrier to scalability is that most of these protocols require feedback from the receivers in the form of acknowledgements (ACKs) or negative acknowledgements (NACKs). If many receivers generate feedback, they may overload the source or intermediate data links with these acknowledgements.
A so-called data carousel protocol can be used to provide scalable file distribution using multicast protocols. A data carousel is a simple protocol that avoids feedback from receivers. Using this protocol, a data server repeatedly sends the same data file using IP multicast. If a receiver does not correctly receive part of the file, the receiver simply waits for that portion of the file to be transmitted again.
Although a data carousel is workable, it often imposes a significant delay as the receiver waits for the next iteration of the file transmission. Forward Error Correction (FEC) can be utilized in conjunction with a data carousel to reduce the re-transmission wait time. Using FEC, error correction packets are included in the data stream. The error correction packets allow reconstruction of lost packets without requiring a wait for the next file transmission.
Using IP multicast, corrupted packets are automatically detected (using checksums) and discarded by the IP protocol. Accordingly, it is only necessary to replace lost packets. Therefore, the FEC protocol described herein deals only with erasure correction rather than with error correction, even though the broader terms xe2x80x9cerror correctionxe2x80x9d and xe2x80x9cFECxe2x80x9d are used throughout the description.
Using forward error correction, a data object is broken into data blocks for transmission in respective IP packets. Assuming that there are k source blocks, these source blocks are encoded into n erasure-encoded blocks of the same size, wherein n greater than k, in a way that allows the original k source blocks to be reconstructed from any k of the erasure-encoded blocks. This is referred to as (n,k) encoding. Many (n,k) encoding techniques are based on Reed-Solomon codes and are efficient enough to be used by personal computers. See Rizzo, L., and Vicisano, L., xe2x80x9cEffective Erasure Codes for Reliable Computer Communication Protocolsxe2x80x9d, ACM SIGCOMM Computer Communication Review, Vol. 27, No. 2, pp. 24-36, April 1997, and Rizzo, L., and Vicisano, L., xe2x80x9cReliable Multicast Data Distribution Protocol-Based on Software FEC Techniquesxe2x80x9d, Proceedings of the Fourth IEEES Workshop on the Architecture and Implementation of High Performance Communication Systems, HPCS""97, Chalkidiki, Greece, June 1997, for examples of an (n,k) encoding method. So-called Tornado codes are viable alternatives to Reed-Solomon codes.
It is desirable in many situations to utilize systematic (n,k) encoding, in which the first k of the n encoded blocks are the original data blocks themselves. If no blocks are lost during transmission, a receiver does not incur any processing overhead when decoding the k blocks of a systematic code. The methods described herein work with, but do not require, systematic encoding.
FIG. 2 shows how this scheme works. A data file in this example contains k blocks, indicated by reference numeral 20. These k blocks are encoded in a step 21 using a Reed-Solomon encoding algorithm, resulting in n erasure-encoded blocks 22, which are sent repeatedly in a step 23 using IP multicast. Each of the n erasure-encoded blocks is the same size as one of the original k blocks. The receiver waits until it has received any k of the erasure-encoded blocks (indicated by reference numeral 24), and then decodes them in a step 25 to recreate the original k source blocks 26.
In practice, k and n are limited when using Reed-Solomon-based codes, because encoding and decoding with large values becomes prohibitively complex. Typical limits are k=64 and n=255.
Because most files are larger than k blocks (assuming k has been limited to some pre-defined maximum), such files are broken into erasure correction (EC) groups, each group representing k blocks of the original data file. Erasure correction is performed independently for each group. Thus, the k blocks of each group are encoded into n erasure-encoded blocks. Each erasure-encoded block is identified by an index relative to its group, specifying which of the n encoded blocks it is, as well as a group identifier associating it with a particular EC group. The index and group identifiers are packaged with the block in a header that prepends the data itself. The data and header are packaged in an IP packet and transmitted using the multicast and data carousel techniques already described.
When using EC groups in this manner, the order of block transmission affects the time required to reconstruct a data object. Suppose, for example, that all n erasure-encoded blocks are sent from one group before sending any from the next group. Receivers with few losses are forced to receive more blocks than they actually need. To avoid this, the data server sends the first block (having index=1) from every group, then the next block (having index=2) from every group, and so on.
This is illustrated in FIG. 3, in which each group 30 is shown as a row of erasure-encoded blocks 32. The arrows show the order of block transmission, from left to right. Upon transmission of block n of the last group, transmission begins again with the first block of the first group.
To complete the reception, a receiver must receive k distinct erasure encoded blocks (i.e. with different index values) from each group. For some groups, more than k blocks may be received, in which case the redundant blocks are discarded. These redundant blocks are a source of inefficiency, as they increase the overall reception time. Supposing that only one additional block is needed to complete the reception, it is possible that a receiver may have to wait an entire cycle of G blocks (receiving blocks from all other groups) before obtaining another block from the desired group. Thus, the inefficiency is related to the number of groups G, which is equal to the number of blocks in the file divided by k.
One danger with this transmission order is that a pattern of periodic network losses may become synchronized with the transmission so as to always impact blocks from certain groups; in the worst case, a single group is always impacted. One solution to this potential problem is to randomly permute the order of groups sent for each index value, thereby spreading periodic losses randomly among groups.
During the reception process, a client buffers incoming blocks as they are received. If enough RAM is available, the blocks are received, sorted, and decoded in main memory before being written to disk. For larger files, a client simply writes all blocks to disk in the order they are received, discarding any blocks over k that are received for a particular group. When reception is complete (i.e., k blocks have been received for each group), the blocks are sorted into groups and then decoded. This method of writing to disk imposes a delay as the file is sorted and decoded. This delay can be minimized to some extent by partial sorting of the blocks before writing them to disk. However, disk I/O can quickly become a bottleneck under this approach. Because there is no mechanism to slow down the sender, allowing the transmission rate to outpace disk writes results in wasted network bandwidth. With next generation networks running at 100 Mbps, and disks running much slower, this can be a serious problem. Furthermore, random disk writes can be ten times slower than sequential disk writes.
The prior art methods described above provide workable solutions to the challenging of distributing popular data objects to a plurality of network clients, while making efficient use of available bandwidth. However, the prior art does not describe an actual embodiment of a system in which these methods are used. In developing such an embodiment, the inventors have developed certain improvements which increase the efficiency and usefulness of the multicast file distribution using data carousel and erasure correction techniques.
The invention embodiments described below include new methods of receiving, buffering, and decoding erasure-encoded blocks such as those described above that are received from a data carousel. In one embodiment, received blocks are written directly to disk as they are received. However, they are segregated by group as they are stored. After receiving the entire data object is complete, each group is read into RAM, sorted, decoded, and then written back to disk.
In another embodiment, erasure-encoded blocks are segregated into sets of contiguous groups as the blocks are written to disk. After reception is complete, each set is read into RAM, sorted, decoded, and written back to disk. In this embodiment, a buffer can be used to buffer incoming erasure-encoded blocks. Received blocks are buffered as long as they are from the same set of groups. When a new block is received from a different set of groups, the buffer is flushed to disk prior to buffering the new block. The blocks are segregated by set as they are written to disk. However, no other sorting takes place at this time. Alternatively, two buffers can be used so that the new block can be written to the second buffer while the first buffer is flushed to disk.
In another embodiment, a receiver maintains a buffer for every set of groups. Incoming blocks are buffered in the appropriate buffer, and each buffer is flushed to disk when the buffer becomes full.
In yet another embodiment, the receiver maintains a single buffer and repeatedly flushes certain blocks of the buffer corresponding to sets of groups. Prior to each write to disk, the system selects a set of groups whose blocks will be flushed from the primary memory buffer. If any set has at least b blocks in the buffer, that set is selected. Otherwise, any other set is selected. The value b is chosen so that the size of the memory buffer is bc+bxe2x88x92c+1, where c is the number of groups in each set of groups.