1. Technical Field
The present invention relates generally to data compression and decompression and, more particularly to systems and methods for providing lossless data compression and decompression using a combination of dictionary and run length encoding.
2. Description of Related Art
Information may be represented in a variety of manners. Discrete information such as text and numbers are easily represented in digital data. This type of data representation is known as symbolic digital data. Symbolic digital data is thus an absolute representation of data such as a letter, figure, character, mark, machine code, or drawing.
Continuous information such as speech, music, audio, images and video frequently exists in the natural world as analog information. As is well-known to those skilled in the art, recent advances in very large scale integration (VLSI) digital computer technology have enabled both discrete and analog information to be represented with digital data. Continuous information represented as digital data is often referred to as diffuse data. Diffuse digital data is thus a representation of data that is of low information density and is typically not easily recognizable to humans in its native form.
There are many advantages associated with digital data representation. For instance, digital data is more readily processed, stored, and transmitted due to its inherently high noise immunity. In addition, the inclusion of redundancy in digital data representation enables error detection and/or correction. Error detection and/or correction capabilities are dependent upon the amount and type of data redundancy, available error detection and correction processing, and extent of data corruption.
One outcome of digital data representation is the continuing need for increased capacity in data processing, storage, retrieval and transmittal. This is especially true for diffuse data where continuing increases in fidelity and resolution create exponentially greater quantities of data. Within the current art, data compression is widely used to reduce the amount of data required to process, transmit, store and/or retrieve a given quantity of information. In general, there are two types of data compression techniques that may be utilized either separately or jointly to encode and decode data: lossy and lossless data compression.
Lossy data compression techniques provide for an inexact representation of the original uncompressed data such that the decoded (or reconstructed) data differs from the original unencoded/uncompressed data. Lossy data compression is also known as irreversible or noisy compression. Negentropy is defined as the quantity of information in a given set of data. Thus, one obvious advantage of lossy data compression is that the compression ratios can be larger than that dictated by the negentropy limit, all at the expense of information content. Many lossy data compression techniques seek to exploit various traits within the human senses to eliminate otherwise imperceptible data. For example, lossy data compression of visual imagery might seek to delete information content in excess of the display resolution or contrast ratio of the target display device.
On the other hand, lossless data compression techniques provide an exact representation of the original uncompressed data. Simply stated, the decoded (or reconstructed) data is identical to the original unencoded/uncompressed data. Lossless data compression is also known as reversible or noiseless compression. Thus, lossless data compression has, as its current limit, a minimum representation defined by the negentropy of a given data set.
It is well known within the current art that data compression provides several unique benefits. First, data compression can reduce the time to transmit data by more efficiently utilizing low bandwidth data links. Second, data compression economizes on data storage and allows more information to be stored for a fixed memory size by representing information more efficiently.
A rich and highly diverse set of lossless data compression and decompression algorithms exist within the current art. These range from the simplest xe2x80x9cadhocxe2x80x9d approaches to highly sophisticated formalized techniques that span the sciences of information theory, statistics, and artificial intelligence. One fundamental problem with almost all modern approaches is the compression ratio verses the encoding and decoding speed achieved. As previously stated, the current theoretical limit for data compression is the entropy limit of the data set to be encoded. However, in practice, many factors actually limit the compression ratio achieved. Most modern compression algorithms are highly content dependent. Content dependency exceeds the actual statistics of individual elements and often includes a variety of other factors including their spatial location within the data set.
Within the current art there also presently exists a strong inverse relationship between achieving the maximum (current) theoretical compression ratio, referred to as xe2x80x9calgorithmic effectivenessxe2x80x9d, and requisite processing time. For a given single algorithm the xe2x80x9ceffectivenessxe2x80x9d over a broad class of data sets including text, graphics, databases, and executable object code is highly dependent upon the processing effort applied. Given a baseline data set, processor operating speed and target architecture, along with its associated supporting memory and peripheral set, xe2x80x9calgorithmic efficiencyxe2x80x9d is defined herein as the time required to achieve a given compression ratio. Algorithmic efficiency assumes that a given algorithm is implemented in an optimum object code representation executing from the optimum places in memory. This is virtually never achieved in practice due to limitations within modern optimizing software compilers. In addition, an optimum algorithmic implementation for a given input data set may not be optimum for a different data set. Much work remains in developing a comprehensive set of metrics for measuring data compression algorithmic performance, however for present purposes the previously defined terms of algorithmic effectiveness and efficiency should suffice.
Of the most widely utilized compression techniques, arithmetic coding possesses the highest degree of algorithmic effectiveness but, as expected, is the slowest to execute. This is followed in turn by dictionary compression, Huffman coding, and run-length coding techniques with respectively decreasing execution times. What is not apparent from these algorithms, that is also one major deficiency within the current art, is knowledge of their algorithmic efficiency. More specifically, given a compression ratio that is within the effectiveness of multiple algorithms, the question arises as to their corresponding efficiency on various data sets.
The present invention is directed to systems and methods for providing lossless data compression and decompression. The present invention exploits various characteristics of run-length encoding, parametric dictionary encoding, and bit packing to comprise an encoding/decoding process having an efficiency that is suitable for use in real-time lossless data compression and decompression applications.
In one aspect of the present invention, a method for compressing input data comprising a plurality of data blocks comprises the steps of:
detecting if the input data comprises a run-length sequence of data blocks;
outputting an encoded run-length sequence, if a run-length sequence of data blocks is detected;
maintaining a dictionary comprising a plurality of code words, wherein each code word in the dictionary is associated with a unique data block string;
building a data block string from at least one data block in the input data that is not part of a run-length sequence;
searching for a code word in the dictionary having a unique data block string associated therewith that matches the built data block string; and
outputting the code word representing the built data block string.
In another aspect of the present invention, the dictionary is dynamically maintained and updated during the encoding process by generating a new code word corresponding to a built data block string, if the built data block string does not match a unique data block string in the dictionary, and then adding the new code word in the dictionary.
In yet another aspect of the present invention, the dictionary is initialized during the encoding process if the number of code words (e.g., dictionary indices) in the dictionary exceeds a predetermined threshold. When the dictionary is initialized, a code word is output in the encoded data stream to indicate that the dictionary has been initialized at that point in the encoding process. An initialization process further comprises resetting the dictionary to only include each possible code word corresponding to a unique data block string comprising a single data block. By way of example, if each data block comprises a byte of data, there will be 256 possible code words for a data block string comprising a single byte. In this instance, the dictionary reset to its initial state will comprise 256 entries.
In another aspect of the present invention, the dictionary further comprises a plurality of control code words, wherein a control code word is designated to represent a dictionary initialization, a run-length encoded sequence, and the end of the input data (or completion of the encoding process). These control words are used in the decoding process to re-create the input data.
In yet another aspect of the present invention, a bit-packing process is employed to pack the bits of successive output code words representing encoded run-length sequences and data block strings.
In another aspect of the present invention, a method for decompressing an encoded data stream comprising a plurality of code words, which is generated using the encoding method, comprises the steps of:
maintaining a dictionary comprising a plurality of code words utilized to generate the encoded data stream, wherein the code words in the dictionary comprise control code words and code words that are each associated with a unique data block string;
decoding and outputting a run-length sequence of data blocks associated with an input code word of the encoded data stream, if the input code word is a control code word in the dictionary that indicates an encoded run-length sequence;
outputting a unique data block string in the dictionary that is associated with an input code word of the encoded data stream, if the input code word is found in the dictionary; and
if the input code word is not found in the dictionary, building a new data block string comprising (1) the unique data block string associated with a previous control word found in the dictionary and (2) the first data block of the unique data block string, adding the new string to the dictionary, and outputting the new string.
These and other aspects, features and advantages of the present invention will become apparent from the following detailed description of preferred embodiments, which is to be read in connection with the accompanying drawings.