There is considerable interest among those in the fields of information storage and communication to reduce the capacity requirements of information so that the information can be stored on storage media and/or transmitted through communication channels having lower capacity than otherwise required. Information represented in forms having reduced capacity requirements can be stored in less space and can be transmitted over communication channels having, for example, lower bandwidth or lower bit rates.
Data "compression" is one technique sometimes used to reduce information capacity requirements. As used herein, the term data "compression" refers to a process of generating an output representation of information in response to an input information stream where the output representation requires fewer data elements than the input stream. The output representation is said to be a "compressed" representation. Data compression is well known and a number of techniques are reviewed by Williams, Adaptive Data Compression, Kluwer Academic Publishers, 1991, pp. 1-104, by Bell, Cleary and Witten, Text Compression, Prentice-Hall, 1990, and by Storer, Data Compression, Computer Science Press, 1988.
Data "decompression" refers to the inverse process used to recover the information stream from a compressed representation. A compression technique is "lossless" if the inverse decompression technique can perfectly recover the input information stream from the compressed representation.
Lempel-Ziv or LZ methods constitute a well known class of lossless compression techniques which parse an input stream into "packets" of information and generate a "token" to represent a group of packets having the same contents. The term "packet" as used herein refers to any convenient grouping of information. Such techniques are referred to as "substitutional" techniques because a token is "substituted" for the contents of a packet. To the extent the token imposes lower information capacity requirements than the packet information it represents, the resulting representation is compressed. "Compression ratios" in excess of 3:1 are not unusual for normal English text; that is, the compressed representation imposes an information capacity requirement one-third of that imposed by the input information stream.
Substitutional compression methods normally use either a "dictionary" or "history" structure to improve the efficiency of token substitution. A history is a particular type of dictionary which is constructed during a compression process and contains tokens representing packets in a portion of an information stream currently held in a buffer. As packets occurring later in the stream are received into the buffer, older packets in the buffer must be discarded. The corresponding tokens in the history can also be discarded. Other dictionary schemes may use more sophisticated buffering techniques which, for example, discard the least-recently-used packets and tokens as later packets are received into the buffer. The term "dictionary" is used herein to refer to a structure which defines a token in terms of the information it represents and which indexes each occurrence of token substitution in the compressed representation. The "defining packet" contains the information which defines that meaning of the token in the compressed representation.
Known compression techniques such as LZ methods attempt to optimize the compression of an input information stream, such as text from a document, by attempting to achieve the highest possible compression ratio. Compression can be enhanced by increasing the packet-to-token ratio or packet-token ratio, which is the number of times a token can be used to represent different instances of packets containing the same information. Compression can also be enhanced by decreasing the information capacity requirements of a token relative to the capacity requirements of the packets it represents. The relative information capacity requirements of tokens as compared to packets can generally be improved by increasing the size or information capacity of each packet; however, this tends to reduce the packet-token ratio in many prior art compression methods. As a result, attempts to optimize the compression ratio generally must balance this ratio against packet size.
The packet-token ratio is especially significant for the compression of large-volume information streams and the compression of multiple information streams, particularly multiple generations of information where each generation is an altered version of the previous generation. Known compression techniques are unable to achieve effective packet-token ratios because parsing is "context sensitive." Even minor changes in the contents of an information stream can radically alter what packets are parsed from the stream.
In contrast to the parsing of LZ and other methods, "context insensitive" parsing will parse similar information streams into identical packets except for the packets parsed from portions of the streams near the dissimilarities. As a result, the packet-token ratio can be much greater. It should be appreciated that each of the information streams may represent a distinct document or different sections from one document, for example.