1. Field of the Invention
This invention relates generally to compression and decompression of data and, more particularly, to the construction of compression dictionaries for use with computer data compression and decompression systems.
2. Description of the Related Art
Data compression refers to transforming a set of data characters from an original state to a compressed state in which the data is represented by a reduced number of characters. More particularly, a string of characters in the original data is replaced with a reduced number of compression symbols in the compressed data. The original string of characters can include, for example, letters, numerals, punctuation marks, spaces or blanks, and special characters. The compressed data can be decompressed to restore the data back to the original state by replacing the compression symbols with the original substrings. Data compression has particular utility in computer systems because it can significantly reduce the storage requirements of large databases, thereby reducing the cost of data processing. For example, databases with large amounts of textual data such as names, locations, job titles, and the like or databases with large amounts of numerical data such as account statements are especially suited to compression techniques.
Computer-implemented data compression is carried out in systems that are typically characterized as providing either hardware compression or software compression. Hardware compression generally refers to compression and decompression that take place at a hardware storage device interface and are implemented by a dedicated hardware processor, such as one or more integrated circuit chips mounted on a circuit board. Generally, the hardware compression can be implemented by processors located at each end of a communications channel so that compression and decompression can take place as data is stored or retrieved. For example, data may be compressed as it is processed for storage on a magnetic disk and may be decompressed as it is retrieved from the disk. Software compression generally refers to data compression that occurs over an entire data set already stored in a storage device. That is, a software compression process is called by a controlling process and receives a block of data to be compressed, performs the data compression, and returns the compressed data to the controlling process. Software compression in a computer facility is generally implemented by software processes that are part of the system installation.
The configuration of a computer data compression system is the result of a trade-off between the reduction in storage space required for compressed data and the computational effort required to compress and decompress the data. The reduction in storage space is generally measured in terms of compression ratio, which relates the number of original symbols or characters, also called the data input size, to the number and length of the compressed symbols, also called the output size.
Computer data compression systems can carry out data compression by scanning an input string of characters to be compressed with a string table, also called a compression dictionary, to generate a string of compression symbols. More particularly, the dictionary is used to parse the input string so that substrings of the input characters are replaced with the dictionary symbols to which they are matched. In this way, long runs of characters or recurring word patterns should be replaced with reduced-length symbols, thereby resulting in data compression. Data decompression is easily carried out using a reverse process to match compression symbols with corresponding original input strings. Dictionaries that provide very efficient compression can be formed by software compression techniques in which the input data to be compressed is analyzed and then the dictionary is created as a result of character occurrence information obtained from the analysis. Alternatively, relatively fast hardware compression can be carried out by selecting a dictionary that was previously created based on the data anticipated to be received in the data channel and using the predetermined dictionary in the compression process.
A compression process can parse a data string, or split it into phrases for dictionary coding. For example, consider the case where a predetermined compression dictionary contains the parsing alphabet M, where EQU M={a, b, ba, bb, abb}
and the dictionary alphabet is mapped to compression output symbols C, wher e EQU C(a)=00, C(b)=010, C(ba)=0110, C(bb)=0111, and C(abb)=1.
Next, consider the data string "babb". If each character of the data string b, a, and so forth requires eight bits for representation, then the uncompressed data string having four characters requires thirty-two bits to be stored in memory. In contrast, using the compression dictionary M, the data string would be encoded as C(ba).C(bb)=0110.0111 or, ideally, as C(b).C(abb)=010.1. In any case, it can be seen that there is a net reduction in bits necessary to store the data string.
Many data compression systems are based on techniques first described by J. Ziv and A. Lempel in 1977. See, for example, U.S. Pat. No. 4,464,650 to Eastman et al. for Apparatus and Method for Compressing Data Signals and Restoring the Compressed Data Signals, issued Aug. 7, 1984. In accordance with the Ziv-Lempel techniques (also referred to by the abbreviation L-Z), an input string of characters to be compressed is scanned such that repeated sequences of characters are replaced with pointers to earlier occurrences of the sequence in the input string.
For example, one form of pointer is an ordered pair (m, n) where m represents the longest matching previously seen phrase in the input string and n represents the next character in the input string. Each input phrase is encoded as an index to the previously seen phrase (a phrase prefix) followed by the next character. The new phrase is then added to the list of phrases that may be referenced, comprising a dictionary. The encoded phrases thereby represent the output of the L-Z process. Thus, the data string "aaabbabaabaaabab" would be parsed into seven phrases as follows: a, aa, b, ba, baa, baaa, and bab. Each of the parsed phrases comprises a previously seen phrase followed by the next character. The phrases would be encoded as a pointer to the previously seen phrase, followed by the character. Thus, the first phrase would be encoded as (0, a), meaning no previously seen phrase, followed by the character "a". The "a" is therefore added to the phrases to be referenced. The second phrase, "aa", would be encoded as (1, a), meaning the first phrase "a" followed by the character "a". The third phrase "b" would be encoded as (0, b). The fourth phrase, "ba", would be encoded as (3, a), meaning the third phrase "b" followed by "a". The fifth phrase (baa) would be the fourth phrase (ba) followed by "a" and therefore would be encoded as (4, a). The sixth phrase would be encoded as "baaa"=(5, a) and, finally, the seventh phrase would be encoded as (bab)=(4, b). The encoded output string would therefore appear as (0, a), (1, a), (0, b), (3, a), (4, a), (5, a), (4, b). It should be readily apparent that the output string can be easily decoded to reproduce the original data string.
As known to those skilled in the art, the encoding described above can be implemented by inserting each parsed phrase into what is referred to as a trie data structure. A trie data structure is a tree where each phrase defines a path that begins at a root (the prefix) and ends at a node, which is then associated with that phrase. Each node of the trie structure contains the number of the phrase it represents. For example, FIG. 1 shows the data structure generated while parsing the string described in the previous paragraph. As shown in the drawing, the last phrase to be inserted was "bab", which identified the fourth node as the longest previous match and caused the creation of the seventh node with the additional "b" character. It should be noted that tracing the path from node 0 to node 7 produces the "bab" phrase that was encoded.
The input string for an L-Z process is scanned according to a scanning window of predetermined character length. Initially encountered substrings are added to a dictionary/compressed data table. Thus, an L-Z dictionary changes with the encoding of input substrings to earlier occurrences and includes a list of previously parsed input strings. In this way, the dictionary can grow as the compression process is completed and can adapt somewhat to the input string being compressed. Some limitation must be placed on the growth of the dictionary in an L-Z process because larger dictionaries require larger output symbol sizes, and this may result in less efficient compression rather than the improved compression one might expect from having more dictionary entries to choose from. For example, U.S. Pat. No. 5,151,697 to Bunton for Data Structure Management Tagging System issued Sep. 29, 1992 describes a method of controlling dictionary size and growth. The performance of L-Z systems is dictated by the size of the scanning window over which substrings are compared to earlier occurrences of substrings, to the substring pointer, and the dictionary construction criteria.
Those skilled in the art will appreciate that data compression with predetermined dictionaries can be relatively fast as compared with L-Z techniques, due largely to the computational effort required by L-Z techniques in building and scanning a parsing tree structure. Generally speaking, L-Z techniques operate more quickly when decoding, or decompressing data, than they do when encoding, or compressing data. In addition, predetermined dictionaries are of a known size in terms of the memory needed for storage. As noted above, L-Z techniques can require excessive storage for the dictionary as it is built and grows during encoding. Those skilled in the art, however, will appreciate that different predetermined compression dictionaries will provide very different compression ratios depending on the match between dictionary entries and input substrings. On the other hand, L-Z techniques provide a process that can adapt to the actual data being compressed.
Some computer systems that support software data compression provide a compression processor and permit a calling routine to specify a predetermined compression dictionary based on a user estimate of the type of data contained in the input string. This permits relatively fast encoding with minimal computational overhead. The compression dictionary is selected from a reservoir or library of generic dictionaries that are adapted to compress a variety of data. Unfortunately, designation of predetermined dictionaries that are not particularly suited to the input string that is received can have negative consequences on the data compression process and result in a minimally compressed data set.
From the discussion above, it should be apparent that there is a need for a method and computer system that permit effective selection of compression dictionaries from among a set of predetermined dictionaries and also permit the encoding process to adapt to the data actually being compressed, thereby permitting data compression to occur quickly and efficiently without needing excessive storage space for compression dictionaries. The present invention satisfies this need.