1. Field of the Invention
The present invention generally relates to a data compression method and a data compression apparatus, which are used for compressing data by omitting redundant parts contained in various kinds of data, such as character codes and image data and, more particularly, to a data compression method and a data compression apparatus which employ a dictionary coding scheme utilizing similarity of data sequences.
2. Description of the Related Art
Various kinds of data, such as character codes and image data, are being handled by information processors, such as computers. Consequently, the quantity of data handled by information processors is increasing. Such a large quantity of data usually contains redundant data strings. Storage capacity for storing data can be reduced in an information processor by performing compression processing for omitting such redundant parts. Further, data transmission capacity can be reduced in the information processor by using compressed data. Thus, a data transmission time can be shortened.
The LZ77 compression method and the LZ78 compression method have been known as typical data compression methods using the dictionary coding scheme. The LZ77 compression method can obtain a sufficient compression ratio by performing a simpler process, as compared with the LZ78. compression method. Consequently, the LZ77 compression method is mainly employed for practical use. Therefore, the LZ77 compression method is described hereinbelow.
Incidentally, the present invention can be applied not only to compression of character codes but to that of various kinds of data. However, hereunder, each of data represented in word units will be referred to as a xe2x80x9ccharacterxe2x80x9d (of an alphabet), and data of an arbitrary number of consecutive words will be referred to as a xe2x80x9ccharacter string or sequencexe2x80x9d, based on information theory.
According to the LZ77 compression method, as illustrated in FIG. 1, a sliding or search buffer 1 of predetermined capacity (incidentally, 16 characters in this example illustrated in this figure) is provided. A character string xe2x80x9cdefabqaaaacabcdexe2x80x9d 2a having already been coded and compressed is stored in this buffer. Subsequently, an encoder searches the sliding buffer for a character string xe2x80x9cabcdxe2x80x9d 2c which is the maximum or longest match between the stored character string 2a and an input character string 2b xe2x80x9cabcdaaaq . . . xe2x80x9d to be encoded. The relative address lof the position of the character string 2c found as the longest match is 5 (incidentally, this indicates that the character string 2c starts 5 characters back from the start of the input character string 2b). Further, the match length, namely, the length of the found character string is 4. Then, the relative address and the match length are encoded. Moreover, in the input character string 2b, the character string xe2x80x9cabcdxe2x80x9d 2c, which is the last longest match, is replaced with a codeword or token (5,4) and thus compressed.
Subsequently, the sliding buffer 1 is shifted four characters to the right. The character string 2a set in the sliding buffer is now xe2x80x9cbqaaaacabcdeabcdxe2x80x9d. Then, the encoder searches the next input character string xe2x80x9caaaq . . . xe2x80x9d 2b for a match, similarly as in the aforementioned case. Consequently, the current character string xe2x80x9caaaxe2x80x9d 2c is found as the longest match. Thus, the occurrence or match position of this character sequence xe2x80x9caaaxe2x80x9d in the buffer 1 is 13 (incidentally, this indicates that the current character string xe2x80x9caaaxe2x80x9d 2c starts 13 characters back from the first character xe2x80x9caxe2x80x9d of the character sequence xe2x80x9caaaxe2x80x9d in the input character string 2b). Further, the match length of this longest match xe2x80x9caaaxe2x80x9d is 3. Then, such an occurrence position and the match length of this character string xe2x80x9caaaxe2x80x9d is encoded in the form of a codeword (13, 3). Moreover, this character string xe2x80x9caaaxe2x80x9d is replaced with this codeword (13, 3).
According to LZ77 compression method, as the coding of the input character string proceeds, the encoder shifts the sliding buffer in this way. Therefore, LZ77 compression method is also referred to as a sliding dictionary method.
If the capacity of the sliding buffer used in such an LZ77 compression method is increased, the length of a character string found as the longest match increases. Consequently, the compression ratio is enhanced. However, as a result of the increase in the capacity of the sliding buffer, the encoder should search for an enormous number of combinations of character strings. Thus, in the case of sequentially searching the sliding buffer, the search requires a great deal of effort and time. Therefore, the LZ77 compression method is performed by actually adopting the following process. Namely, a character string (namely, a prefix) consisting of first two to four characters of an input character string and the occurrence position of the prefix are entered into a table as occasion demands, and then the prefix of the input character string is collated with the character list entered into the table, instead of collating all kinds of character strings of the sliding buffer with the input character string. The time required for such search is significantly reduced by employing this process.
The tables used for such a search are a look-up table and a hash table. A method of using a look-up table is to make a character string 2d to be searched for have a one-to-one correspondence to an address in a look-up table 3, as illustrated in FIG. 2. The past occurrence position (namely, the relative address) of the character string is stored at a corresponding address in the look-up table 3. Thus, according to this method, the past occurrence position of the character string xe2x80x9cabxe2x80x9d 2d is known by looking up the character string xe2x80x9cabxe2x80x9d 2d in the table 3 once to search for this string. Therefore, this method has an advantage in that the search is achieved at a very high speed.
However, in the case that the character string to be searched for is long, the number of combinations of character strings is raised to a higher power. Thus, the look-up table should have an enormous number of addresses. Therefore, this method has a drawback in that a very large amount of memory is needed so as to allocate such an enormous addresses to the look-up table. For example, in the case that the number of characters is 1 (incidentally, 1 character consists of 8 bits), 28*1 (=256 bits) of memory are needed. Further, in the case that the number of characters is 2, 28*2 (=64 kbits) of memory are needed. Moreover, in the case that the number of characters is 3, 28*3 (=16 Mbits) of memory are needed. Therefore, the actual limit to the number of characters is 2. Additionally, this method has another drawback in that, when a character string to be searched for is long, only a small part of the look-up table is actually used (namely, only a small part of memory area assigned to the look-up table is used for entering the past occurrence positions of the character string into this table) and that thus, the look-up table is in a sparse state, and the efficiency of use of the look-up table is low.
In the case of a method of searching for a character string by using a hash table, as illustrated in FIG. 3, masking processing is performed on a codeword, which corresponds to a character string to be searched for, in such a way as to decrease the number of bits of the codeword (namely, the degeneration of the codeword is performed). Thus, a hash code 6 is generated (see 4) so that a plurality of character strings having a common degenerated state share an area of the hash table 5. Thus, this method features that, as compared with the method of searching for a character string using a look-up table, a longer character string can be searched for, when a search area, in which the character string is searched for, is equivalent to that used in the method using the look-up table.
However, in the case of the method using the hash table obtained in this way, the degeneration is performed on character strings to be searched for. Thus, there occurs a problem (what is called a collision or conflict problem) that a character string xe2x80x9cabcxe2x80x9d 2d and another character string, which have a common degenerated state, may be entered into the same area 7 provided in the hash table.
To solve this collision problem, this method further requires an additional operation of collating a character string found by the search with each of characters of a character string to be searched for and checking whether or not the searched or found character string is the character string to be actually searched for.
As described above, in case of the aforementioned method of searching for a character string by using the look-up table according to the LZ77 compression scheme, while a character string to be searched for can be found at a high speed by looking up the character string to be searched for, in the look-up table only once, the table size of the look-up table is increased or raised to a higher power with an increase in the number of characters of the character string to be searched for. Thus, the look-up table has an enormous table size, and the number of characters to be used for the search cannot be large (actually, the limit to the number of characters is 2). Consequently, this method using the look-up table has a drawback in that the compression ratio cannot have a very high value.
On the other hand, in the case of the aforementioned method of searching for a character string by using a hash table, the table size of a necessary table is smaller, as compared with the method of searching for a character string by using the look-up table. Thus, although the search is achieved at a high speed by using the table of a reasonable table size, this method using the hash table has a drawback in that an additional collating operation for solving the collision problem is needed still more. Incidentally, this collating operation should be performed on each of characters of the character string to be searched for and requires a great deal of effort.
The present invention aims at solving such drawbacks of the prior art. Accordingly, an object of the present invention is to provide a data compression method, and a data compression apparatus, which can search for a character string by using a table, whose table size is substantially equal to a table size of the method of searching for a character string by using a hash table, even when the character string is a long character string consisting of three or four characters, without performing a collating operation on a character string to be searched for, so as to prevent an occurrence of a collision problem.
To achieve the foregoing object, according to a first aspect of the present invention, there is provided a data compression method for generating compressed data by performing a compression process on an uncompressed data string, which comprises the steps of setting a plurality of consecutive characters, which are contained in the uncompressed data string, as a character string to be searched, allocating bits of a bit string representing the aforesaid character string to at least two codewords to thereby generate first and second searching codewords, obtaining first and second array contents from first and second array tables, in which information on past occurrence positions of character strings is previously stored, by using the aforesaid first and second searching codewords as array addresses, collating the obtained first and second array contents with each other, and obtaining past occurrence position information corresponding to the aforesaid character string according to the aforesaid first and second array contents when the first and second array contents match with each other.
According to a second aspect of the present invention, there is provided a data compression method for generating compressed data by performing a compression process on an uncompressed data string, which comprises the steps of setting a plurality of consecutive characters, which are contained in the uncompressed data string, as a character string to be searched, allocating bits of a bit string representing the aforesaid character string to two codewords to thereby generate a first searching codeword and a second searching codeword that is a complementary codeword to the aforesaid first searching codeword, obtaining an array content from a first array table, in which the aforesaid second codeword relating to past occurred character strings is previously stored, by using the aforesaid first searching codeword relating to the set character string to be searched at present as an array address, collating the obtained array content with the aforesaid second codeword, and obtaining information on past occurrence positions of the aforesaid set character string from a second array table, in which past occurrence positions of character strings are previously entered, by using the aforesaid first codeword as an array address when the array content matches with the aforesaid second codeword.
According to a third aspect of the present invention, there is provided a data compression method for generating compressed data by performing a compression process on an uncompressed data string, which comprises the steps of setting a plurality of consecutive characters, which are contained in the uncompressed data string, as a character string to be searched, allocating bits of a bit string representing the aforesaid character string to two codewords to thereby generate a first searching codeword and a second searching codeword that is a complementary codeword to the aforesaid first searching codeword, obtaining a plurality of codewords, whose starting point is the aforesaid first codeword, by performing an operatilon on the aforesaid first codeword, obtaining a plurality of array contents from a first array table, in which the aforesaid second codeword relating to past occurred character strings is previously stored, by using the aforesaid obtained plurality of codewords relating to the set character string to be searched at present as array addresses, collating the obtained plurality of array contents with the aforesaid second codeword, and obtaining information on past occurrence positions of the aforesaid set character string from a second array table, in which past occurrence positions of character strings are previously entered, by using the aforesaid codewords obtained by aforesaid operation as an array address when the array content matches with the aforesaid second codeword.
In the case of the data compression method according to each of the aforementioned aspects of the present invention, the bits of a bit string representing the character string to be searched are allocated to at least two codewords, and the table is looked up correspondingly to each of the codewords. Further, results of the look-up of the tables are collated with each other. Thus, it is checked whether or not the character string occurred in the past and whether or not information on the past occurrence position is entered in the table. Therefore, as compared with the case that the bit string representing the character string itself is used as the addresses (namely, the case of using the look-up table), the size of the table, which is necessary for looking up the table, is significantly reduced by allocating the character string to at least two codewords and constituting addresses.
Hence, even when a relatively long character string (for instance, a character string having 3 or 4 characters) is set, necessary memory does not increase very much, as compared with the conventional methods. Moreover, only an operation of checking the contents of the arrays is employed as the operation of checking the match after the look-up of the character string to be searched for. Thus, the amount of work is considerably reduced, as compared with the method using a hash table, which requires checking a match corresponding to each of the characters of the character string to be searched for.