Data compression methods can be divided into two broad categories: namely, “lossy” and lossless data compression methods. Lossy data compression methods result in a loss of some information during compression. On the other hand, lossless data compression refers to the ability to change a data set, without losing information, so that the data set can be stored within less space after compression as compared to before compression. Lossy compression methods are often employed in situations where a loss of information is tolerable (e.g. compression of audio and video data). By contrast, lossless compression methods are preferably employed in situations where a loss of information is undesirable and fidelity is a priority (e.g. compression of text files in a database).
Lossless data compression methods—as particularly applied in database systems storing text information—help to reduce capital and operating costs. A typical database system has a finite amount of storage (e.g. memory, disk space and the like). As the amount of information in a database increases, new allocations of storage may be required. However, adding and maintaining additional blocks of memory adds capital and operating costs. In the context of large database systems, such as those employed in the financial services sector, such capital and operating cost increases can make database management very expensive. Accordingly, compressing data is a useful way of utilizing available storage and limiting requirements for new allocations of storage.
A particular subset of lossless data compression methods, referred to hereinafter as binary-string/symbol substitution methods, have been developed that exploit the redundancy of byte-strings repeated within a text file. Compression is accomplished by replacing frequently occurring byte-strings with shorter identifiers/placeholders, referred to hereinafter as symbols. The Lempel-Ziv 1978 (LZ78) method of data compression is at the root of this class of binary-string/symbol substitution methods. In accordance with the LZ78 method: a static dictionary is created that contains frequently occurring byte-strings and corresponding symbols; and, compression is accomplished by replacing frequently occurring byte-strings with respective symbols (i.e. exchanging text-symbol pairs).
A number of criteria are considered when evaluating the performance of a compression method, such as for example, computational overhead, efficiency and compression ratio. As a general rule, for compression to be considered effective, the storage allocation for the combination of a static dictionary and a respective compressed data set should be substantially smaller than that for the corresponding uncompressed data set. To that end, a static dictionary is typically defined as having a fixed and limited size, which in turn means that only the most frequently occurring byte-strings are stored in accordance with known methods of creating a static dictionary. However, there are a number of problems associated with this.
The most frequently occurring byte-strings are typically quite short, which means that the longest byte-strings that could be used may not be stored for use in the static dictionary since the dictionary is biased towards retaining shorter more frequently occurring byte-strings. Yet, during the actual compression process byte-strings in the data set are matched to the longest byte-strings stored in the static dictionary. Subsequently, the static dictionary contains a number of short byte-strings that are rarely used, and the resultant compression ratio of the compression process may be reduced because the longest byte-strings that could be matched may not be stored in the static dictionary for use during the compression process.