The present invention relates to a technique for increasing the speed of data compression, and more specifically, it relates to a technique for performing data compression by applying a hash function to a selected part of a character string to calculate a hash value, searching, using the hash value, through entries in a bucket chain that has the hash value which is previously registered in a hash table, and finding a longest matching character string.
In compressing a file, for example, zip, LHA, gzip, bzip2, or LZMA (Lempel-Ziv-Markov chain-Algorithm) has been used hitherto. In the case of bzip2, a method called blocking sorting is used to achieve a high compression rate. In contrast, zip, LHA, and gzip use a combined method of LZ77 coding and Huffman coding. The LZ77 coding is one of dictionary-based coding methods, in which an input character string (also a symbol string) is registered in a dictionary and encoding is performed using the dictionary.
Dictionary-based coding methods include a static dictionary method and an adaptive dictionary method (also called a dynamic dictionary method). In the static dictionary method, a dictionary is compiled prior to encoding, and encoding is performed based on the dictionary. The static dictionary method needs the same dictionary to be prepared for encoding and decoding. In a method in which a dictionary for decoding is attached to a file, a significant decrease in the compression rate is thus inevitable.
On the other hand, in the adaptive dictionary method, a dictionary is not prepared beforehand and instead is compiled while a file (input stream) is being read. Then, when a character string already registered in the dictionary appears, the character string is converted into a position index in the dictionary for compression. In the adaptive dictionary method, the dictionary is empty in the beginning and thus a character string cannot be compressed at an initial stage. However, as file reading proceeds, a sufficient number of character strings are registered in the dictionary, and therefore a high compression rate of the file can be achieved.
As adaptive dictionary methods, for example, RLE, BPE, Deflate, and LZ coding (Ziv-Lempel coding) are known. As the LZ coding, for example, LZ77, LZ78, LZSS, LZW, LZML, LZO, LZMA, LZX, LZRW, LZJB, LZT and ROLZ are known.
Among the above-mentioned adaptive dictionary methods, the LZ coding is the most well-known method. The LZ coding is roughly categorized into the LZ77 coding (developed in 1977) and the LZ78 coding (developed in 1978). The LZ77 coding and the LZ78 coding are different in the way of compiling a dictionary. In the LZ77 coding, a dictionary is compiled in accordance with a sliding dictionary method, while, in the LZ78 coding, a dictionary is compiled in accordance with a dynamic dictionary method.
The LZ77 coding has many variations. Among them, a widely-used coding in general is the LZSS coding. In the LZSS coding, a sliding window and a longest matching method are used. In programming the LZSS coding, a process of searching a reference part of the sliding window for a longest matching string sequence is performed. In the process for searching for a longest matching sequence, a hash method is used. That is, in the LZSS compression, a hash table is used in order to reduce the time required for a search for a longest matching sequence. A registration of a character string to the hash table is done by obtaining, using a hash function, a hash value for the character string with a predetermined number of characters from the beginning of the input character string and then putting the input character string (precisely, a pointer to the character string) to the hash table. Thus, in the LZSS coding, a dictionary is compiled by calculating a hash value for each character string while sliding the input character string, and at the same time, a longest matching sequence that matches a character string previously registered in the dictionary is identified.
In file compression, various methods have been proposed, aiming at increasing the compression rate, the compression speed, and the decoding speed and improving memory requirements.
JP6-83573 describes a process of the LZW coding which utilizes a list structure of an external hash method for a dictionary search (Claim 1).
JP2009-296131 describes a method for selecting a hash function (Summary).
JP11-85771 describes an algorithm-selection mean for selecting one from multiple pieces of hash value calculation means (Summary).
JP2011-138230 describes achieving a reduction in the size of a data file and a reduction in search noise (Summary).
JP5-61910 describes performing a search by inputting a search character string including multiple characters into hash function generating means, detecting, using a generated hash value, appearance position information of the corresponding characters stored in the above full index, and determining whether or not the detected appearance position information of the individual characters corresponds relatively to the order of position of the search character string (Summary).
JP2010-515114 describes a method and system regarding efficient processing for purposes such as data hashing and/or elimination of data redundancy (paragraph 0001).
JP2000-57151 describes a technique that enables to increase the speed of search performance and to minimize an increase of the total index size (Summary).
Kunihiko Sadakane et. al., “Improving the Speed of LZ77 Compression by Hashing and Suffix Sorting”, IEICE transactions on fundamentals of electronics, communications and computer sciences, E83-A, No. 12, pages 2689-2698, December 2000, describes improving the speed of the LZ77 compression by hashing and suffix sorting (Summary).