1. Field of the Invention
The present invention generally relates to a data compression method and a data compression apparatus, which are used for compressing data by omitting redundant parts contained in various kinds of data, such as character codes and image data-and, more particularly, to a data compression method and a data compression apparatus which employ a dictionary coding scheme utilizing similarity of data sequences.
2. Description of the Related Art
Various kinds of data, such as character codes and image data, are being handled by information processors, such as computers. Consequently, the quantity of data handled by information processors is increasing. Such a large quantity of data usually contains redundant data strings. Storage capacity for storing data can be reduced in an information processor by performing compression processing for omitting such redundant parts. Further, data transmission capacity can be reduced in the information processor by using compressed data. Thus, a data transmission time can be shortened.
The LZ77 compression method and the LZ78 compression method have been known as typical data compression methods using the dictionary coding scheme. The LZ77 compression method can obtain a sufficient compression ratio by performing a simpler process, as compared with the LZ78 compression method. Consequently, the LZ77 compression method is mainly employed for practical use. Therefore, the LZ77 compression method is described hereinbelow.
Incidentally, the present invention can be applied not only to compression of character codes but to that of various kinds of data. However, hereunder, each of data represented in word units will be referred to as a “character” (of an alphabet), and data of an arbitrary number of consecutive words will be referred to as a “character string or sequence”, based on information theory.
According to the LZ77 compression method, as illustrated in FIG. 1, a sliding or search buffer 1 of predetermined capacity (incidentally, 16 characters in this example illustrated in this figure) is provided. A character string “defabqaaaacabcde” 2a having already been coded and compressed is stored in this buffer. Subsequently, an encoder searches the sliding buffer for a character string “abcd” 2c which is the maximum or longest match between the stored character string 2a and an input character string 2b “abcdaaaq . . . ” to be encoded. The relative address of the position of the character string 2c found as the longest match is 5 (incidentally, this indicates that the character string 2c starts 5 characters back from the start of the input character string 2b). Further, the match length, namely, the length of the found character string is 4. Then, the relative address and the match length are encoded. Moreover, in the input character string 2b, the character string “abcd” 2c, which is the last longest match, is replaced with a codeword or token (5,4) and thus compressed.
Subsequently, the sliding buffer 1 is shifted four characters to the right. The character string 2a set in the sliding buffer is now “bqaaaacabcdeabcd”. Then, the encoder searches the next input character string “aaaq . . . ” 2b for a match, similarly as in the aforementioned case. Consequently, the current character string “aaa” 2c is found as the longest match. Thus, the occurrence or match position of this character sequence “aaa” in the buffer 1 is 13 (incidentally, this indicates that the current character string “aaa” 2c starts 13 characters back from the first character “a” of the character sequence “aaa” in the input character string 2b). Further, the match length of this longest match “aaa” is 3. Then, such an occurrence position and the match length of this character string “aaa” is encoded in the form of a codeword (13, 3). Moreover, this character string “aaa” is replaced with this codeword (13, 3).
According to LZ77 compression method, as the coding of the input character string proceeds, the encoder shifts the sliding buffer in this way. Therefore, LZ77 compression method is also referred to as a sliding dictionary method.
If the capacity of the sliding buffer used in such an LZ77 compression method is increased, the length of a character string found as the longest match increases. Consequently, the compression ratio is enhanced. However, as a result of the increase in the capacity of the sliding buffer, the encoder should search for an enormous number of combinations of character strings. Thus, in the case of sequentially searching the sliding buffer, the search requires a great deal of effort and time. Therefore, the LZ77 compression method is performed by actually adopting the following process. Namely, a character string (namely, a prefix) consisting of first two to four characters of an input character string and the occurrence position of the prefix are entered into a table as occasion demands, and then the prefix of the input character string is collated with the character list entered into the table, instead of collating all kinds of character strings of the sliding buffer with the input character string. The time required for such search is significantly reduced by employing this process.
The tables used for such a search are a look-up table and a hash table. A method of using a look-up table is to make a character string 2d to be searched for have a one-to-one correspondence to an address in a look-up table 3, as illustrated in FIG. 2. The past occurrence position (namely, the relative address) of the character string is stored at a corresponding address in the look-up table 3. Thus, according to this method, the past occurrence position of the character string “ab” 2d is known by looking up the character string “ab” 2d in the table 3 once to search for this string. Therefore, this method has an advantage in that the search is achieved at a very high speed.
However, in the case that the character string to be searched for is long, the number of combinations of character strings is raised to a higher power. Thus, the look-up table should have an enormous number of addresses. Therefore, this method has a drawback in that a very large amount of memory is needed so as to allocate such an enormous addresses to the look-up table. For example, in the case that the number of characters is 1 (incidentally, 1 character consists of 8 bits), 28*1 (=256 bits) of memory are needed. Further, in the case that the number of characters is 2, 28*2 (≈64 kbits) of memory are needed. Moreover, in the case that the number of characters is 3, 28*3 (≈16 Mbits) of memory are needed. Therefore, the actual limit to the number of characters is 2. Additionally, this method has another drawback in that, when a character string to be searched for is long, only a small part of the look-up table is actually used (namely, only a small part of memory area assigned to the look-up table is used for entering the past occurrence positions of the character string into this table) and that thus, the look-up table is in a sparse state, and the efficiency of use of the look-up table is low.
In the case of a method of searching for a character string by using a hash table, as illustrated in FIG. 3, masking processing is performed on a codeword, which corresponds to a character string to be searched for, in such a way as to decrease the number of bits of the codeword (namely, the degeneration of the codeword is performed). Thus, a hash code 6 is generated (see 4) so that a plurality of character strings having a common degenerated state share an area of the hash table 5. Thus, this method features that, as compared with the method of searching for a character string using a look-up table, a longer character string can be searched for, when a search area, in which the character string is searched for, is equivalent to that used in the method using the look-up table.
However, in the case of the method using the hash table obtained in this way, the degeneration is performed on character strings to be searched for. Thus, there occurs a problem (what is called a collision or conflict problem) that a character string “abc” 2d and another character string, which have a common degenerated state, may be entered into the same area 7 provided in the hash table.
To solve this collision problem, this method further requires an additional operation of collating a character string found by the search with each of characters of a character string to be searched for and checking whether or not the searched or found character string is the character string to be actually searched for.
As described above, in case of the aforementioned method of searching for a character string by using the look-up table according to the LZ77 compression scheme, while a character string to be searched for can be found at a high speed by looking up the character string to be searched for, in the look-up table only once, the table size of the look-up table is increased or raised to a higher power with an increase in the number of characters of the character string to be searched for. Thus, the look-up table has an enormous table size, and the number of characters to be used for the search cannot be large (actually, the limit to the number of characters is 2). Consequently, this method using the look-up table has a drawback in that the compression ratio cannot have a very high value.
On the other hand, in the case of the aforementioned method of searching for a character string by using a hash table, the table size of a necessary table is smaller, as compared with the method of searching for a character string by using the look-up table. Thus, although the search is achieved at a high speed by using the table of a reasonable table size, this method using the hash table has a drawback in that an additional collating operation for solving the collision problem is needed still more. Incidentally, this collating operation should be performed on each of characters of the character string to be searched for and requires a great deal of effort.