There is an exponentially increasing disparity between CPU (central processing unit) speeds and disk bandwidth Moore's law predicts a doubling of processor speed every 18 months, whereas disk bandwidth has been doubling only every 2.2 years. The result is an I/O (input/output) bottleneck that undermines many of the advances in processing speed and memory capacity. The process of simply getting data into and out of core memory takes too long. In cases where data does not even fit in main memory, paradigms like external memory and streaming algorithms have been explored as alternatives to the RAM model for designing algorithms. Often, though, increases in memory capacity obviate the need to favor I/O over RAM complexity. Still, simply getting the input from disk to algorithm comprises a significant portion of the time spent by an application.
Lossless compression has long been used to reduce storage requirements and network transmission costs. Compressing data can help reduce the amount of data that must be accessed from main memory and therefore may be useful in mitigating the I/O bottleneck. Consider transferring b bytes from disk to memory. If the nominal disk bandwidth is d bytes/second, it requires
  b  dtime to effectuate the transfer. If the data can be compressed by some compressor with compression ratio r (the ratio of the size of the compressed data to that of the original) however, and the uncompression speed is ur bytes/second (compression and uncompression speeds typically depend on the resulting compression ratio, which tends to be similar for different files from the same domain or source), then it takes
      r    ⁡          (              b        d            )        +      r    ⁡          (              b                  u          r                    )      time to read and uncompress the compressed data. Storing the compressed data therefore speeds data transfer whenever
      rb    ⁡          (                        1          d                +                  1                      u            r                              )        <      b    d  or equivalently whenever
                              u          r                >                  d          ⁢                                          ⁢                                    r                              1                -                r                                      .                                              (        1        )            
Equation (1) yields several useful observations. First, the benefit of compression is independent of the amount of data being transferred when assuming sufficient data is available to realize the assumed compression ratio. Second, for any fixed compression ratio, the benefit of compression increases proportionately to CPU speed, assuming that uncompression is CPU bound, as it is for compression schemes like Huffman, Lempel-Ziv, and Burrows-Wheeler. This mitigates the I/O bottleneck because increasing CPU speed directly speeds the transfer of data to applications when data is compressed. Third, for a given CPU, the benefit of compression depends on the compression ratio r. As r improves (i.e. gets smaller), so does
      r          1      -      r        ,in that for compression to be worthwhile in terms of overall data transfer, the demand on uncompression speed relative to the disk bandwidth becomes less onerous.
Compression schemes used in practice (e.g., Huffman coding used in pack, Lempel-Ziv coding used in compress, gzip, and zlib, and the Burrows-Wheeler transform used in bzip) all share the characteristic that uncompression must start from the beginning of the compressed data. That is, to retrieve any byte requires uncompressing the entire text up to the desired access point. This complicates any application that requires arbitrary access into the data While some theoretical advances have been made in the area of string matching in compressed data, general-purpose computation over compressed data remains elusive.
This access problem may be generalized to situations having the following characteristics. First, data is stored after being transformed in some manner (e.g. compression, encryption, etc.). Second, upon retrieving the data the transformation must be reversed (e.g. uncompression, decryption, etc.) before an application can act on the retrieved data Third, after retrieving the data and reversing the transform, if the data is then altered, the data must be re-transformed (e.g. compressed, encrypted, etc.) prior to writing the data back to some form of slow memory, such as a disk drive, tape, CD ROM, DVD or the like. Given the existing disparity between CPU speed and I/O bandwidth, it would be preferable when retrieving data not to have to reverse the transformation from the beginning of the file all the way to the point for which access is desired. Further, when writing altered data back to slow memory, it would be preferable not to have to re-transform the entire file from the beginning all the way up to the portion of the file that is being altered by the writing process. Rather, it would be more advantageous to be able to read and write access randomly to any point within the transformed file.
Some attempts have been made in the past to provide more random access to transformed data stored in slow memory. Typically, the file is partitioned into smaller components, and these components are individually transformed/untransformed (e.g. compressed/uncompressed, encrypted/decrypted, etc.) such that access can be made to a smaller component containing the requested data rather than having to transform and/or untransform the entire file up to the requested data Although these techniques have provided improved random access to a transformed file, they don't necessarily provide a means by which the segmentized components of the transformed file can be indexed and manipulated without significantly burdening the improved performance sought through random access.
One such technique as applied to compression of files partitions the original file into segments, then compresses each compressed segment individually and stores each compressed segment starting in the exact location in slow memory (usually disk memory) in which the original uncompressed segment was stored. Thus, while a more random access into the transformed file is facilitated without the need for additional indexing, the disk space is fragmented, disk space is wasted and access to disk is less than optimal. Another approach partitions the file into segments and then applies the transform (e.g. compression, encryption, etc.) to each segment. The resulting “chunks” (i.e. transformed segments) are then stored contiguously and packed tightly to avoid wasting space. However, if a particular segment is written to and data within that segment is thereby altered, its resulting chunk may increase in size as a result. In this case, the entire layout of the compressed file must be rearranged to accommodate the larger chunk. While it has been proposed to instead store a chunk that has grown larger than its original size at the end of the file (i.e. out-of-order), this solution will impact the efficiency of disk access where optimal access requires that files be in-order.
The foregoing techniques have been implemented as part of operating system (OS) file systems. As a result, every file stored on the system is treated in the same manner, regardless of whether the data is truly benefited by the transform. For example, random data does not compress well, and segmenting it for purposes of compression may actually degrade access time to these files. Because the segmenting process is inaccessible to the user of the computer system when it is performed as part of a file system, there is no way to easily disable the segmenting process as to files that do not benefit from the transform Nor is there any way to fine tune the segmenting process to optimize the performance advantages with respect to the files on the system. The segmenting process is fixed and applied to all files in the same manner, and the parameters of the process are inaccessible by a user at the file system level.