Data compression methods can be divided into two broad categories: namely, “lossy” and lossless data compression methods. Lossy data compression methods result in a loss of some information during compression. On the other hand, lossless data compression refers to the ability to change a data set, without losing information, so that the data set can be stored within less space after compression as compared to before compression. Lossy compression methods are often employed in situations where a loss of information is tolerable (e.g. compression of audio and video data). By contrast, lossless compression methods are preferably employed in situations where a loss of information is undesirable and fidelity is a priority (e.g. compression of text files in a database).
Lossless data compression methods—as particularly applied in database systems storing text information—help to reduce capital and operating costs. A typical database system has a finite amount of storage (e.g. memory, disk space and the like). As the amount of information in a database increases, new allocations of storage may be required. However, adding and maintaining additional blocks of memory adds capital and operating costs. In the context of large database systems, such as those employed in the financial services sector, such capital and operating cost increases can make database management very expensive. Accordingly, compressing data is a useful way of utilizing available storage and limiting requirements for new allocations of storage.
A particular subset of lossless data compression methods, referred to hereinafter as binary-string/symbol substitution methods, have been developed that exploit the redundancy of byte-strings repeated within a text file. Compression is accomplished by replacing frequently occurring byte-strings with shorter identifiers/placeholders, referred to hereinafter as symbols. The Lempel-Ziv 1978 (LZ78) method of data compression is at the root of this class of binary-string/symbol substitution methods. In accordance with the LZ78 method: a static dictionary is created that contains frequently occurring byte-strings and corresponding symbols; and, compression is accomplished by replacing frequently occurring byte-strings with respective symbols (i.e. exchanging text-symbol pairs).
A number of criteria are considered when evaluating the performance of a compression method, such as for example, computational overhead, efficiency and compression ratio. As a general rule, for compression to be considered effective, the storage allocation for the combination of a static dictionary and a respective compressed data set should be substantially smaller than that for the corresponding uncompressed data set. To that end, a static dictionary is typically defined as having a fixed and limited size, which in turn means that only the most frequently occurring byte-strings are stored in accordance with known methods of creating a static dictionary. However, there are a number of problems associated with this.
The most frequently occurring byte-strings are typically quite short, which means that the longest byte-strings that could be used may not be stored for use in the static dictionary since the dictionary is biased towards retaining shorter more frequently occurring byte-strings. Yet, during the actual compression process byte-strings in the data set are matched to the longest byte-strings stored in the static dictionary. Subsequently, the static dictionary contains a number of short byte-strings that are rarely used, and the resultant compression ratio of the compression process may be reduced because the longest byte-strings that could be matched may not be stored in the static dictionary for use during the compression process.
U.S. patent application Ser. No. 11/278,118 (filed Mar. 30, 2006) discloses a method for creating a static dictionary, the method comprising: providing a plurality of data trees, each of the plurality of data trees comprising a root node, at least one of the plurality of data trees comprising at least one child node, wherein each root node and each child node stores an associated binary pattern, wherein each child node is adapted to store a symbol associated with the child node and an occurrence count value associated with the child node; defining a binary pattern string, the binary pattern string comprising a concatenation of the binary patterns in a direct path from the root node to a particular child node, and wherein an occurrence count value for the binary pattern string is the occurrence count value of the particular child node; and, incrementing the occurrence count value of the binary pattern string when the particular child node is visited. This approach is based on counting the number of times an end-node of a particular byte-string is visited, while not incrementing a count for nodes storing characters in the middle of the byte-string as often as each time such nodes are visited. The result is an occurrence count metric that favors longer byte-strings.
Regardless of the manner in which the logical tree for compression is constructed, during compression operation, a binary representation of the logical tree must be used (a physical compression dictionary). This binary representation must be properly set up so as to minimize CPU usage, amongst other resources. Otherwise, compression will take a long time.