1. Technical Field
The invention is related to a computer-implemented data compression system and process, and more particularly, to a lossless data compression system and process that employs a unique approximate string matching scheme.
2. Background Art
Lossless compression of signals is often of critical importance. Because of storage limitationsxe2x80x94e.g. hard drive sizexe2x80x94or bandwidth limitationsxe2x80x94e.g. modem speedxe2x80x94the need for compression is often necessary. For many applications, the signal of interest must be compressed losslessly. For example, text or computer programs would be worthless if the retrieved data is not identical to the transmitted (or stored) data. Existing methods do well on this type of data. However, there are other types of data which current methods do very poorly on, but which need to be compressed losslessly. For example, certain types of image signals are both very large (and therefore in need of compression) and contain sensitive details. Examples of such signals include medical (e.g. MRI, X-ray, CT) or Synthetic Aperture Radar (SAR) images, where distortion due to lossy compression could result in missing critical features (e.g. tumors or enemy tanks), and images whose usage requires the maximum possible quality (e.g. images for magazine covers, frames for movies). Other such data signals include sounds sources (e.g. music or speech) or echo recordings (e.g. from geological surveys such as earthquake or oil-finding).
Simple lossless compression is typically performed by entropy coding (examples of such techniques are Huffman or arithmetic coding). In this case a special code is designed which maps symbols in the source to symbols which are transmitted (or stored). An entropy code uses a small number of bits for common symbols in the source, at the expense of using long codes for infrequent symbols.
Current lossless compression schemes take advantage of repeating strings of symbols within a source. For example, the combination of symbols xe2x80x9ctxe2x80x9d, xe2x80x9chxe2x80x9d and xe2x80x9cexe2x80x9d, occurs very frequently in English text. These schemes take advantage of this redundancy by decreasing the cost of transmitting or storing such common sequences. Examples of such schemes are the Lempel-Ziv family of source codes, including LZ77, LZ78, LZW and derivatives [LZ76, ZL77, Wel84], as well as the Prediction by Partial Matching (PPM) algorithm [CW84, Mof 90]. LZW is used as the foundation for the PkZip, gzip and WinZip products. LZ77 is used in the CompuServe Graphics Interchange Format (GIF) image format. The PPM algorithm (specifically the PPM* algorithm) are generally considered to be at or near the state-of-the-art in the compression of text sources. Both LZ and PPM are so-called universal compression schemes, which means that in the limit they will converge to the maximum amount of compress possible. However, this is only true in the limitxe2x80x94in the case of an infinite amount of data. In practice we are always dealing with finite amounts of data, so the rate of convergence is of critical importance.
However, these existing lossless data compression schemes do not work well when the source alphabet is large, as in the case of sound, images, and other natural signals. They fail because they cannot exploit the redundancy in these signals. This is because in such signals, strings will rarely repeat exactly. As a result, the convergence of these algorithms to optimal compression is very slow. For realistic sources, even large sources on the order of terabytes, these algorithms will fail to reach compression levels near the theoretical optimum.
In natural signals, while data strings will rarely repeat exactly, they will often repeat approximately. For example, in images, a row is often very similar to the row above it. However, it is rarely exactly the same, due to lighting and structural variations. Even in the case of a uniform object with uniform lighting, long repeating strings of pixel values are extremely uncommon due to the noise in the image capturing process. Even a slight variation in the sequence of pixels will prevent the LZ and PPM algorithms from finding and exploiting the underlying redundancy. For example, consider two rows of pixels from a black and white image shown in the following table:
Such an array of pixels might arise from an image of a white stripe on a dark background. Using the LZ algorithm or the PPM algorithm on these pixels will result in no compression at all.
Thus, while current lossless data compression methods work well for data characterized by strings that repeat exactly, efforts continue to find a method that can effectively compress data that lacks these identically matching data strings.
It is noted that in the preceding paragraphs the description refers to various publications identified by alphanumeric designators contained within a pair of brackets, e.g. [LZ76, ZL77, WelB4]. A listing of the publications corresponding to each designator can be found at the end of the Detailed Description section.
The present invention is directed at a lossless data compression system and process that employs a unique approximate string matching scheme to overcome the problems associated with current compression methods. This system and process includes both an encoder and a decoder.
Generally, the encoding portion of this system and process entails identifying mutually exclusive blocks of data values within the source data being encoded that have lengths, which when represented with the pointer-residual approach described below, exhibit the lowest overall cost for transmitting or storing the source data. These data blocks are referred to as optimal blocks. A block of earlier occurring source data is also identified for each optimal block. These earlier occurring blocks are chosen so that when compared to the optimal block, their difference exhibits the lowest possible entropy. The optimal blocks and their associated blocks of earlier occurring data values are used to generate residual data. This residual data is made up of blocks of data values formed by taking the difference between the values in each corresponding data location of each optimal block and its associated block of earlier occurring source data, starting with the first data location of both blocks. The choice of a block of earlier occurring source data for use in forming a residual data block is based on a cost analysis which is designed to minimize the entropy of differences between the previous block and the new block of source dataxe2x80x94at least to a desired degree. The purpose of this difference minimization process is to create blocks of residual data having a highly kurtotic distribution so that most of the data is centered around zero. This distribution makes the block of residual data amenable to commonly used entropy-based compression techniques, such as Huffman, Shannon-Fano, or even higher order techniques such as LZ77, LZW and their derivatives, or PPM. In addition to the residual data, a series of pointers are generated. Each pointer is initially affiliated with a separate one of the optimal blocks and identifies the location and length of the block of earlier occurring source data associated with this optimal block. Preferably, prior to compressing the encoded data, each pointer is assigned to the residual data location that corresponds to the location of the first data value of the optimal block to which the pointer was affiliated. The residual data is then compressed using any appropriate entropy-based compression technique, such as those mentioned above. If desired, the pointers can also be compressed using similar techniques at this point in the process. Finally, the compressed residual data and pointers are combined to form a compressed code capable of being transmitted or stored.
The decoding portion of the system and process generally entails, first decompressing the residual data and the pointer (if applicable) using an appropriate decoder for the particular compression technique used to compress the residual data, and then recovering the source data from the residual data and pointers. The process of recovering the source data begins by decoding the first data value from the residual data by adding the first value of the residual data to a predetermined constant. Preferably, this predetermined constant is a mean of the original source data values and was used to encode the first data valued during the encoding process (as will be described more fully in the detailed description section of this specification). The next step in the process is to select the first occurring pointer. In a preferred embodiment of the present invention, each pointer takes the form of an offset-count pair having an offset value defining the offset to an earlier occurring data location and a count value defining a number of data values in a block of data. A block of data values is identified in the previously decoded data by, first counting back through the decoded data from the last decoded data location by the number of locations indicated by the offset value of the currently selected offset-count pair (which in this first instant is the first occurring pair). The data location so identified represents the first data value of the block. The rest of the block is made up of the previously decoded data values immediately following the first value, up to a limit indicated by the count value of the currently selected offset-count pair. The decoding process continues with the identification of a block of residual values. This block is made up of those values associated with residual data locations starting at and following the residual data location to which the currently selected offset-count pair is assigned, up to a limit indicated by the count value of the currently selected pair. Thus, two blocks of data are identified. Namely, a block of residual values and an equal-sized block of earlier occurring, previously decoded data. The data values of the identified block of previously decoded data values is added in sequence to the correspondingly located values of the identified block of residual values, beginning with the first value in both blocks. The resulting sums are designated the next block of recovered source data. The currently selected offset-count pair is then deselected and replaced with the next occurring pair. The foregoing process is repeated for the newly selected offset-count pair until there are no more offset-count pairs to select. At that point all the source data would have been recovered.