1. Field of the Invention
The present invention relates to an apparatus for compressing data strings to be compressed and a method thereof using a dictionary made by the data strings. Data to be compressed is handled by a unit. The unit may be bits, characters, words, pixels, gray levels, or some other appropriate unit. The present invention can be applied to the compression of various types of data and is not limited to the compression of a character code. In the following, however, we call the unit as a character based on information theory.
2. Description of the Related Art
The data amount being handled has been increasing as various types of data such as a character code, image data, etc., is being handled by computer in recent years. When such a great amount of data is handled, the required storage capacity can be decreased or the data can be transmitted to a remote place at high speed by omitting a redundant part of the data and compressing the remaining data.
As a conventional data compression technology, two technologies are known such as a dictionary model encoding technology using the repetition of a data string and a probability statistical model encoding technology using the appearance frequency of a data string. As the representative method of the dictionary model, LZ77 encoding and LZ78 encoding technologies are well known (xe2x80x9cDocument data compression algorithm introductionxe2x80x9d, by Tomohiko Uematsu, CQ publication, pp. 131-208, 1995). Since the LZ77 encoding technology can obtain a sufficient compression ratio with an easy process compared with the LZ78 encoding technology, the LZ77 encoding technology has become a mainstream when these technologies are actually used.
As shown in FIG. 1A, a slide buffer 1 with a regular size is provided in the LZ77 encoding technology. The buffer 1 retrieves a character string that matches an input character string in the longest length in the buffer 1, and encodes the input character string using the location and length. This encoding method is called as slide dictionary law, since the buffer 1 is sliding, as the encoding process proceeds.
When an input character string xe2x80x9cabcdaaaq . . . xe2x80x9d right next to the buffer 1 is encoded in FIG. 1A, xe2x80x9cabcdxe2x80x9d is the longest matching character string among the character strings matched in the buffer 1. Thereupon, a relative address xe2x80x9c5 (bytes)xe2x80x9d of the nearest location of the longest matching character string and the input character string (the matching location A against the input location B=the fifth character to the left of the input location B) is set as a matching location. Then, a code such as (matching location, matching length)=(5,4) is generated while setting a length xe2x80x9c4 (bytes)xe2x80x9d of the longest matching character string as a matching length. In this way, xe2x80x9cabcdxe2x80x9d of the first part of the input character string is replaced with (5,4). Similarly, the next character string xe2x80x9caaaxe2x80x9d is replaced with a code (13,3).
However, a slide buffer that is actually used is much longer, and when character strings inside the buffer are sequentially retrieved, a long time is required in order to detect the longest matching character string. In fact, all the character strings inside the buffer are not compared with the input character string, but the appearance locations of the prefixes (about 2 to 4 characters) of character strings are registered in a table. Then, the input string is compared with a character string positioned at the location of the table where the character string is stored. A Look-Up Table (LUT) and a Hash Table are used for such a retrieval process.
FIG. 1B shows a character string retrieval process using the LUT. LUT2 of FIG. 1B sets the prefix of a character string inside the buffer 1 as an address, and stores the location (address or pointer) of the character string in the buffer 1. At the time of the retrieval process, the area of the LUT2 is accessed by setting the prefix of an input character string as an address, and the location of the corresponding character string is obtained.
In the case that a plurality of character strings with the same prefixes exist in the buffer 1, a plurality of appearance locations are stored in the form of a linked list 3. Therefore, locations of all the corresponding character strings stored in the buffer 1 can be obtained only by accessing the LUT2 one time. Here, the prefix of two characters is used, and the area of the LUT2 corresponding to the prefix xe2x80x9cabxe2x80x9d of the input character string stores two appearance locations, utilizing the linked list 3.
Thus, since the LUT corresponds to one character string to be retrieved, to one area of the table, and can obtain required information only by looking up a table one time, the LUT can perform a retrieval process at high speed. However, in the case of the retrieval process of a long character string, the number of the areas required for the table increases with n-th power of character strings which appear. Accordingly, the area required becomes large. If the number of characters which appear is 28=256, 256n areas are required for the prefixes of n characters.
When the character string to be retrieved becomes long, however, an actually used (registered) area remains only part of the prepared areas, and the inside of the table becomes thinned. Therefore, in the case of the retrieval process of a long character, the memory efficiency deteriorates.
Thereupon, a character string to be retrieved is degenerated in the hash table, so that a plurality of character strings are made to share one area. Therefore, after the table is looked up, it should be checked whether the thus-obtained character string is actually retrieved. In comparison with the LUT, the hash table can retrieve a longer character string using the same table area as that of the LUT.
FIG. 1C shows a character string retrieval process using the hash table. A hash code generation unit 4 of FIG. 1C generates a hash code 5 from a prefix xe2x80x9cabcxe2x80x9d of the input character string, and it accesses a hash table 6 while setting the code as an address. In the hash table 6, the location inside the buffer 1 corresponding to the hash code 5 is stored in the hash table 6. By comparing a character string xe2x80x9cabcdexe2x80x9d at the location with the input character string, it is checked whether the prefixes of the two strings match each other. When the two strings match each other, it is judged that a character string in accord with the input character string exists in the buffer 1.
In the case of the hash table, too, a plurality of appearance locations are stored in the form of a linked list for a plurality of character strings with the same prefix in the buffer 1. In either case, the liked list is used to retrieve a longest matching character string.
However, there is the following problem in the above-mentioned conventional data compression technology.
In the case that a long matching character string is retrieved by using the LUT, the inside of the table becomes thinned as mentioned above, since only part of the table is used even if a table with a large area is prepared. In a hash table, the table size becomes smaller relative to that of the LUT. However, the inside of the table becomes thinned, if there is little input data. Therefore, as for the hash table, there is the problem that a memory is not used effectively.
Further, there is another problem that since a plurality of appearance locations stored in a linked list should be traced one by one in the case of a retrieval process of the longest matching character string, a longer retrieval time is required as the number of character strings with the same prefix increases.
The object of the present invention is to offer a data compression apparatus and a string retrieval apparatus with a reasonable memory capacity corresponding to an input data amount, thereby effectively performing the longest match retrieval process in a data compression process based on the dictionary model encoding technology.
A compression apparatus of the present invention is provided with a data storage device, sort device, appearance location storage device, detection device, and encoding device.
The data storage device stores character string data to be compressed. The sort device rearranges each sorted character string of which a start point is each of addresses stored in the data storage device, on the basis of the contents of each character string. The appearance location storage device stores address information that shows the address of each character string in the order of the rearranged character strings. The detection device detects a repetitive character string on the basis of the address information stored in the appearance location storage device. The encoding device encodes the detected repetitive character string and outputs the encoded character string.