The present invention relates in general to data compression techniques. In particular, the present invention relates to the manipulation of electronic data while the data is encoded for storage in a form that requires less storage space.
Data compression is used in most data storage systems in use today. Typical compression techniques analyze data in terms of bits. It is known that analyzing data in terms of bits destroys the information structure that is required to edit and search data fields.
The benefits of using a compression technology arise from the impact of compression on the size of the data. These benefits relate not only to the size of the stored data but also to the speed at which the data can be accessed.
Reduction in the stored data size is important in archival and mass storage systems. Document and record databases are typical of archival systems where commercial databases dominate the mass storage market. Reduction in the size of data in transmission systems is also important. Examples of on-line data systems, where data compression is used, include commercial network transmissions and some internet data links.
A desired feature for such known data compression techniques is the application of lossless data compression and decompression techniques, meaning that the data must be able to be exactly recovered from the compressed data. In these applications users are particularly sensitive to the error rates and error susceptibility of the data.
It is known that Huffman Coding is the basis for many of the commercially available compression programs. Huffman Coding begins with an analysis of the entire data set, and establishes the weight of each symbol in the set. Libraries of repeated data are then assembled, with frequent symbols encoded using less bits than less frequent symbols. Sequences of binary patterns that represent the data stream are replaced by a coded table of binary terms. The coded table is expanded based on the occurrence of new binary patterns. The original data is restored from this binary data stream and the embedded table.
Another known compression technique is the run length encoding technique (“RLE”). RLE compression schemes encode a data stream by replacing a repeating sequence of bytes with a count and the repeated byte.
Another very common compression technique involves the use of the Lempel-Ziv-Welch (“LZW”) algorithm. LZW compression schemes encode a streaming byte sequence using a dynamic table. The dynamic table is embedded in the encoded data stream. LZW variants typically achieve better data compression than those available using either the RLE or Huffman encoding techniques.
Another encoding technique uses arithmetic coding. Arithmetic coding uses a probability line, 0-1, and assigns to every symbol a range in this line based on its probability; the higher the probability, the higher the range that is assigned to the symbol. Once the ranges and the probability line have been defined, the encoding of the symbols is initiated, where a symbol defines where the output floating point number gets located.
In any data storage system, the data can be stored either unencoded or encoded. The stored data typically needs to be updated using operations such as locating particular data items in the storage system, inserting more data, deleting existing data and changing the data. When the stored data is unencoded, such operations are trivial. However, when the data is stored as encoded data, these operations become more complex. For example, in order to move to a particular offset, data needs to be decoded first so that the decoded offset of the data can be calculated. In order to insert data, the original data needs to be decoded, the new data inserted, and then the resultant data encoded back into the data storage system. In order to delete data, the data to be deleted usually needs to be extracted from the encoded data, removed, and then the modified data re-encoded; and to change the data, the data to be changed usually needs to be extracted from the encoded data, changed, and then the modified data re-encoded. The need to first decode the data, manipulate it and then encode it again, adversely impacts the storage requirements and the speed of such data manipulations.
There is therefore a need for a data compression technology that allows for the manipulation of data in its compressed form without having to first uncompress the data.