1. Technical Field
The present invention relates generally to a data compression and decompression and, more particularly, to systems and methods for data compression using content independent and content dependent data compression and decompression.
2. Description of Related Art
Information may be represented in a variety of manners. Discrete information such as text and numbers are easily represented in digital data. This type of data representation is known as symbolic digital data. Symbolic digital data is thus an absolute representation of data such as a letter, figure, character, mark, machine code, or drawing,
Continuous information such as speech, music, audio, images and video, frequently exists in the natural world as analog information. As is well known to those skilled in the art, recent advances in very large scale integration (VLSI) digital computer technology have enabled both discrete and analog information to be represented with digital data. Continuous information represented as digital data is often referred to as diffuse data. Diffuse digital data is thus a representation of data that is of low information density and is typically not easily recognizable to humans in its native form.
There are many advantages associated with digital data representation. For instance, digital data is more readily processed, stored, and transmitted due to its inherently high noise immunity. In addition, the inclusion of redundancy in digital data representation enables error detection and/or correction. Error detection and/or correction capabilities are dependent upon the amount and type of data redundancy, available error detection and correction processing, and extent of data corruption.
One outcome of digital data representation is the continuing need for increased capacity in data processing, storage, and transmittal. This is especially true for diffuse data where increases in fidelity and resolution create exponentially greater quantities of data. Data compression is widely used to reduce the amount of data required to process, transmit, or store a given quantity of information. In general, there are two types of data compression techniques that may be utilized either separately or jointly to encode/decode data: lossless and lossy data compression.
Lossy data compression techniques provide for an inexact representation of the original uncompressed data such that the decoded (or reconstructed) data differs from the original unencoded/uncompressed data. Lossy data compression is also known as irreversible or noisy compression. Entropy is defined as the quantity of information in a given set of data. Thus, one obvious advantage of lossy data compression is that the compression ratios can be larger than the entropy limit, all at the expense of information content. Many lossy data compression techniques seek to exploit various traits within the human senses to eliminate otherwise imperceptible data. For example, lossy data compression of visual imagery might seek to delete information content in excess of the display resolution or contrast ratio.
On the other hand, lossless data compression techniques provide an exact representation of the original uncompressed data. Simply stated, the decoded (or reconstructed) data is identical to the original unencoded/uncompressed data. Lossless data compression is also known as reversible or noiseless compression. Thus, lossless data compression has, as its current limit, a minimum representation defined by the entropy of a given data set.
There are various problems associated with the use of lossless compression techniques. One fundamental problem encountered with most lossless data compression techniques are their content sensitive behavior. This is often referred to as data dependency. Data dependency implies that the compression ratio achieved is highly contingent upon the content of the data being compressed. For example, database files often have large unused fields and high data redundancies, offering the opportunity to losslessly compress data at ratios of 5 to 1 or more. In contrast, concise software programs have little to no data redundancy and, typically, will not losslessly compress better than 2 to 1.
Another problem with lossless compression is that there are significant variations in the compression ratio obtained when using a single lossless data compression technique for data streams having different data content and data size. This process is known as natural variation.
A further problem is that negative compression may occur when certain data compression techniques act upon many types of highly compressed data. Highly compressed data appears random and many data compression techniques will substantially expand, not compress this type of data.
For a given application, there are many factors that govern the applicability of various data compression techniques. These factors include compression ratio, encoding and decoding processing requirements, encoding and decoding time delays, compatibility with existing standards, and implementation complexity and cost, along with the adaptability and robustness to variations in input data. A direct relationship exists in the current art between compression ratio and the amount and complexity of processing required. One of the limiting factors in most existing prior art lossless data compression techniques is the rate at which the encoding and decoding processes are performed. Hardware and software implementation tradeoffs are often dictated by encoder and decoder complexity along with cost.
Another problem associated with lossless compression methods is determining the optimal compression technique for a given set of input data and intended application. To combat this problem, there are many conventional content dependent techniques that may be utilized. For instance, file type descriptors are typically appended to file names to describe the application programs that normally act upon the data contained within the file. In this manner data types, data structures, and formats within a given file may be ascertained. Fundamental limitations with this content dependent technique include:
(1) the extremely large number of application programs, some of which do not possess published or documented file formats, data structures, or data type descriptors;
(2) the ability for any data compression supplier or consortium to acquire, store, and access the vast amounts of data required to identify known file descriptors and associated data types, data structures, and formats; and
(3) the rate at which new application programs are developed and the need to update file format data descriptions accordingly.
An alternative technique that approaches the problem of selecting an appropriate lossless data compression technique is disclosed, for example, in U.S. Pat. No. 5,467,087 to Chu entitled xe2x80x9cHigh Speed Lossless Data Compression Systemxe2x80x9d (xe2x80x9cChuxe2x80x9d). FIG. 1 illustrates an embodiment of this data compression and decompression technique. Data compression 1 comprises two phases, a data pre-compression phase 2 and a data compression phase 3. Data decompression 4 of a compressed input data stream is also comprised of two phases, a data type retrieval phase 5 and a data decompression phase 6. During the data compression process 1, the data pre-compressor 2 accepts an uncompressed data stream, identifies the data type of the input stream, and generates a data type identification signal. The data compressor 3 selects a data compression method from a preselected set of methods to compress the input data stream, with the intention of producing the best available compression ratio for that particular data type. There are several limitations associated with the Chu method. One such limitation is the need to unambiguously identify various data types. While these might include such common data types as ASCII, binary, or unicode, there, in fact, exists a broad universe of data types that fall outside the three most common data types. Examples of these alternate data types include: signed and unsigned integers of various lengths, differing types and precision of floating point numbers, pointers, other forms of character text, and a multitude of user defined data types. Additionally, data types may be interspersed or partially compressed, making data type recognition difficult and/or impractical. Another limitation is that given a known data type, or mix of data types within a specific set or subset of input data, it may be difficult and/or impractical to predict which data encoding technique yields the highest compression ratio.
Accordingly, there is a need for a data compression system and method that would address limitations in conventional data compression techniques as described above.
The present invention is directed to systems and methods for providing fast and efficient data compression using a combination of content independent data compression and content dependent data compression. In one aspect of the invention, a method for compressing data comprises the steps of:
analyzing a data block of an input data stream to identify a data type of the data block, the input data stream comprising a plurality of disparate data types;
performing content dependent data compression on the data block, if the data type of the data block is identified;
performing content independent data compression on the data block, if the data type of the data block is not identified.
In another aspect, the step of performing content independent data compression comprises: encoding the data block with a plurality of encoders to provide a plurality of encoded data blocks; determining a compression ratio obtained for each of the encoders; comparing each of the determined compression ratios with a first compression threshold; selecting for output the input data block and appending a null compression descriptor to the input data block, if all of the encoder compression ratios do not meet the first compression threshold; and selecting for output the encoded data block having the highest compression ratio and appending a corresponding compression type descriptor to the selected encoded data block, if at least one of the compression ratios meet the first compression threshold.
In another aspect, the step of performing content dependent compression comprises the steps of: selecting one or more encoders associated with the identified data type and encoding the data block with the selected encoders to provide a plurality of encoded data blocks; determining a compression ratio obtained for each of the selected encoders; comparing each of the determined compression ratios with a second compression threshold; selecting for output the input data block and appending a null compression descriptor to the input data block, if all of the encoder compression do not meet the second compression threshold; and selecting for output the encoded data block having the highest compression ratio and appending a corresponding compression type descriptor to the selected encoded data block, if at least one of the compression ratios meet the second compression threshold.
In yet another aspect, the step of performing content independent data compression on the data block, if the data type of the data block is not identified, comprises the steps of: estimating a desirability of using of one or more encoder types based one characteristics of the data block; and compressing the data block using one or more desirable encoders.
In another aspect, the step of performing content dependent data compression on the data block, if the data type of the data block is identified, comprises the steps of: estimating a desirability of using of one or more encoder types based on characteristics of the data block; and compressing the data block using one or more desirable encoders.
In another aspect, the step of analyzing the data block comprises analyzing the data block to recognize one of a data type, data structure, data block format, file substructure, and/or file types. A further step comprises maintaining an association between encoder types and data types, data structures, data block formats, file substructure, and/or file types.
In yet another aspect of the invention, a method for compressing data comprises the steps of:
analyzing a data block of an input data stream to identify a data type of the data block, the input data stream comprising a plurality of disparate data types;
performing content dependent data compression on the data block, if the data type of the data block is identified;
determining a compression ratio of the compressed data block obtained using the content dependent compression and comparing the compression ratio with a first compression threshold; and
performing content independent data compression on the data block, if the data type of the data block is not identified or if the compression ratio of the compressed data block obtained using the content dependent compression does not meet the first compression threshold.
Advantageously, the present invention employs a plurality of encoders applying a plurality of compression techniques on an input data stream so as to achieve maximum compression in accordance with the real-time or pseudo real-time data rate constraint. Thus, the output bit rate is not fixed and the amount, if any, of permissible data quality degradation is user or data specified.
These and other aspects, features and advantages of the present invention will become apparent from the following detailed description of preferred embodiments, which is to be read in connection with the accompanying drawings.