1. The Structure of Compressed Archived Files
Electronic data files are often downloaded in an archived and compressed format. That is, one or more files are first combined together into a single file called an “archive.” The archive is then compressed into a smaller, compressed file. The compression of an archive file is more effective (i.e., it results in a smaller amount of data) than the individual compression of unarchived files.
Compressed data is, in general, “sequential access” data. This means that it is not possible to select a random position in the compressed data and to uncompress and make use of the information contained at that location. To make use of the data at a random position in the data, it is necessary to read the data from the beginning. If a compressed file is particularly large, it is often compressed in units called “blocks.” The compressed file is sequential access with respect to the blocks, but a computer can begin reading at the beginning of any one of the blocks. This is useful for error recovery: if a compressed file is partly damaged, any undamaged blocks are still readable.
Like compressed files, archive files also store data in blocks. Each file in the archive starts at the beginning of a new block, but the file may take up several blocks. In practice, each archived file is stored together with a “header” that contains information about the file, such as the name of the file and when the file was created. The header is usually stored at the beginning of the file, at the start of an archive block.
Structuring archive files in blocks simplifies processing when files are removed from the archive. By starting new files at the beginning of a block, de-archiving software does not need to look for the start of a new file in the middle of a block.
The structure of a conventional compressed archive is illustrated in FIG. 1. Data files 10, 12, and 14 are combined into an archive file 16. The archive file 16 is in turn compressed to a compressed file 18.
The archive file 16 is divided into archive blocks, such as archive blocks 20 and 22. (Archive blocks are delineated in FIG. 1 with dotted lines.) File boundaries 24 and 26 align with the beginning of archive blocks. (File boundaries are illustrated by solid bolded lines in FIG. 1.) The compressed archive 18 is divided into compression blocks, such as compression blocks 28 and 30. (Compression blocks are delineated by solid unbolded lines in FIG. 1.)
As is seen in the illustration of the compressed archive 18, the compression blocks (such as blocks 28 and 30) do not necessarily align with file boundaries or with archive blocks (such as blocks 20′ and 22′, the compressed equivalent of archive blocks 20 and 22).
2. Error Recovery with Compressed Archived Files
Compressed archived files are useful for downloading data files for several reasons. First, it is simpler and more convenient for a user or a software program to request a single archive file than it is to request several different files for download. Second, downloading an archive—even if it is not compressed—can be faster than downloading individual files because it is not necessary to reestablish a new connection between computers for each different file. Third, compression of the data files means that fewer bytes of data need to be transmitted over the connection, resulting a faster download time. Finally, the archiving of the data files before compression often results in more effective compression and—as a result—still shorter download times.
Nevertheless, there are disadvantages to the use of compressed archived files for data transfer. Compressed archives may be rather large. As a result, a compressed archive can be difficult to manage if there is an interruption, such as an error or a disconnection during download. One known way of resuming the download is to request the entire file again in the hope that the second download attempt will not fail. Of course, this route is particularly inefficient. A more effective alternative is to save the compressed archive as it is downloaded. If the download is interrupted, the downloading computer can request that the download resume at or before the point of failure. The “range retrieval request” feature of HTTP 1.1, as described in section 14.36.2 of RFC 2068 (January 1997), allows a computer to request a download starting at a position within a file. Although this technique is more efficient because the file is not downloaded twice, the sequential access nature of compressed data requires the downloading computer to store large amounts of archived data in a local data storage before decompression. This technique can be ineffective if there is limited storage space on the downloading computer.