1. Field of the Invention
The present invention relates to a data compression method utilizing a slide dictionary, and more particularly to a data compression method suited to compress a text document, a program and the like and a compressed data transmitting method using it.
2. Description of the Related Art
A data compression method includes a lossless method and a lossy method. The lossless method is a reversible compression method capable of completely restoring data and is used to compress mainly a text document, a program and the like. The lossy method is an un-reversible compression method and is used to compress an image, voice and a moving image. Data compression is used in order to reduce the amount of data to be transmitted in data communication.
As one of the lossless compression methods, a data compression method using a slide dictionary is known. This data compression method searches for the longest matching partial string from a data series that previously appeared and outputs the location of the partial string and a matched length as codes. The data series previously appeared are stored in a dictionary. Since the detection range of this dictionary slides during compression, the dictionary is generally called a slide dictionary.
The recent spread of mobile terminals is remarkable. In the service of a mobile terminal, a small capacity of communication is frequently conducted. For example, the amount of data of an HTTP request from a mobile terminal to a server (upward data) is approximately only 1 kilobyte (KB). In data exchange between a mobile terminal and a server and wireless communication by an RFID tag or the like, there is a strong tendency that data with similar contents, such as header information or the like, are frequently repeated in a series of data exchange.
In the conventional data compression method utilizing a slide dictionary, partial strings that previously appeared are registered in a dictionary (is learned by a dictionary) Generally, in order to complete a dictionary, approximately 8 KB of data must be read. In this case, if the amount of data is small, registration (learning) sufficient to compress cannot be made. Therefore, a sufficient compression ratio cannot be obtained.
In order to solve this problem, this applicant has proposed a data compression method for improving a compression ratio by registering in advance frequently appearing characters in a dictionary as an initial value prior compression and matching data to be compressed with the initial value in the dictionary (Japanese Patent Application No. H5-241777). According to this data compression method, the compression ratio of a character string registered in a dictionary as an initial value can be improved since it can be compressed even when it first appears.
FIGS. 1A, 1B and 1C explain the method for registering an initial value in a dictionary which is disclosed by the Japanese Patent Application No. H5-241777.
FIG. 1A shows the types of character strings existing in sample data for generating an initial value, using a tree structure. The characters, “a”, “b”, “c” and “d” of each node in the tree structure shown in FIG. 1A indicate the character in the sample data, and the figure in a rectangle under it indicates the appearance frequency of the character in each character string.
When character strings whose appearance frequency is equal to or more than a prescribed threshold 2 are extracted refer to the tree structure of FIG. 1A, five character strings of “aaa”, “abc”, “bb”, “cc” and “d” are obtained as in FIG. 1B. And these five character strings are registered in the dictionary 1000 as initial values (see FIG. 1C).
In this way, by registering in advance character strings with high appearance frequency in a dictionary, based on sample data, a data compression ratio can be improved.
This applicant has also proposed the super lossless data compression (SLC) method shown in FIG. 2 (Japanese Patent No. 3541930 and U.S. Pat. No. 6,320,522 B1).
In the SLC method, a hash table is used as a dictionary 2001. And an arbitrary number of characters (three characters in this case) at the top of an already appeared character string in data to be compressed 2000 is converted into a hash value by a hash function 2002 and the hash value and the length of the already appearing character string (character string length) are registered in the dictionary 2001. A serial number starting from 1 is assigned to the character string of the data to be compressed 2000 at the top as an appearance position. A character string that repeatedly appears by sliding a sliding window 2005 is checked, and a character string that coincides with an already appeared character string is encoded into a code (appearance position, length). In this case, the appearance position is an appearance position of the already appeared character string that is registered in the dictionary 2001 and is read from the dictionary 2001 using a hash value as a key.
In FIG. 2 shows an example where data to be compressed 2000 is “compression&decompression . . . ” and a character string “compression” that appears twice in this character string is encoded into a code (1, 11). The hash value of the leading three characters “com” of “compression” is i and an appearance position (=1) corresponding to the hash value i is read from the dictionary 2001.
The prior art shown in FIGS. 1A-1C improves data compression efficiency by registering a character string that frequently appears as an initial value before compressing data. However, since an appearing character string (short sentence) whose appearance frequency is equal to or more than a prescribed threshold is registered in a dictionary without processing it, the size of the initial value becomes large.
The prior art shown in FIG. 2 converts an arbitrary number of the leading characters of an already appeared character string that is registered into a hash value in order to detect it in a dictionary and registers the hash value together with the appearance position of the already appeared character string in the dictionary. However, in the dictionary (hash table), one hash value can register only one piece of appearance position information. An initial value character string also includes different character strings whose hash values happen to be the same. If there is such hash value collision, an initial value registered in a dictionary is overwritten by an initial value with the same hash value that appeared after it and the initial value previously registered is not used.