The present invention relates to methods of lossless database compression and, more particularly, to a method of lossless database compression for groups of database entries having a pre-determined common part.
Data compression techniques are well-known in the art. Lossless data compression is used when the data has to be uncompressed exactly as it was before compression. Text files are stored using lossless techniques, since losing a single character can make the text misleading or incomprehensible. Archival storage of master sources for images, video data, and audio data generally needs to be lossless as well.
In prior-art methods of lossless data compression, the essential figure of merit for data compression is the “compression ratio”, or ratio of the size of a compressed file to the original uncompressed file. However, there are strict limits to the amount of compression that can be obtained with lossless compression. Lossless compression ratios are generally in the range of 2:1 to 8:1.
One of the simplest forms of data compression is known as “run length encoding” (RLE), which is sometimes known as “run length limiting” (RLL). A more sophisticated approach to lossless data compression is Huffman coding, in which short codewords are assigned to those input blocks having high probabilities and long codewords are assigned to those input blocks having low probabilities. A Huffman code is designed by merging together the two least probable characters, and repeating this process until there is only one character remaining. A code tree is thus generated and the Huffman code is obtained from the labeling of the code tree.
One particularly popular and efficient method utilizes the Lempel-Ziv algorithm, which is a variable-to-fixed length code. In the Lempel-Ziv algorithm, the input sequence is parsed into non-overlapping blocks of different lengths while constructing a dictionary of blocks seen thus far. The Lempel-Ziv algorithm exploits the fact that words and phrases within a text file are likely to be repeated. When they do repeat, they can be encoded as a pointer to an earlier occurrence, with the pointer accompanied by the number of characters to be matched.
Pointers and uncompressed characters are distinguished by a leading flag bit, with a “0” indicating a pointer and a “1” indicating an uncompressed character. This means that uncompressed characters are extended from 8 to 9 bits, which works against compression to a small degree.
One key to the operation of the Lempel-Ziv algorithm is a sliding history buffer, also known as a “sliding window”, which stores the text most recently transmitted. When the buffer fills up, the oldest contents thereof are discarded. The size of the buffer is important: if the buffer size is too small, finding string matches will be less likely; if too large, the pointers will be larger, working against compression.
Surveys of the most basic and prevalent lossless data compression techniques include: Pasi Ojala, “Compression Basics”, (http://www.cs.tut.fi/˜albert/Dev/pucrunch/packing.html); “Introduction/Lossless Data Compression”, (http://www.vectorsite.net/ttdcmp1.html), and “Lossless Data Compression” (http://www.data-compression.com/lossless.html). Additional surveys are widely available in the literature.
Given a specific type of file, the contents of the file, particularly the orderliness and redundancy of the data, can strongly influence the compression ratio. In some cases, using a particular data compression technique on a data file where there isn't a good match between the two can actually result in a bigger file. Thus, it is essential to tailor data compression techniques to specific applications and specific data patterns.
Data compression would appear to be important for search-engine type applications. The essential figures of merit for data compression for such applications are not limited to the compression ratio and lossless transformation. Search-engine type applications have a host of additional requirements for data compression techniques, including the ability to conduct fast searches, preferably in a deterministic fashion, and including easily-manageable updating or maintenance of the database.
There is therefore a recognized need for, and it would be highly advantageous to have, a lossless data compression method that is specifically developed and adapted for search-engine type applications, a method that achieves satisfactory compression ratios while enabling deterministic searching and facile maintenance of the database.