1. The Field of the Invention
The present invention relates generally to digital data compression techniques. More specifically, the present invention relates to a method and computer executable instructions for compressing computer data using lossless data compression techniques. Even more specifically, the present invention relates to an efficient compression algorithm capable of efficiently switching between compression and non-compression modes.
2. The Relevant Technology
Data compression is a technique wherein a signal, or computer data which requires a certain number of bits for representation, is represented, or encoded using a fewer number of bits. The ratio between the number of bits required by the original signal to the number of bits required by the encoded signal is known as the compression ratio. The complimentary process in which the signal, or computer data is expanded and reconstructed in its original form is called decompression, decoding or reconstruction.
Data compression is a technology which has matured during the most recent decade. Upon initial blush, it may be thought that with the advances in technology and the availability of computer resources such as memory and execution bandwidth, that such an abundance of resources would render data compression unnecessary. In fact, the contrary is the more precise case. That is to say, while we presently have more computer resources available such as memory and bandwidth, the conservation and efficient use of such resources is of preeminent concern due to the explosive appetite for and availability of transferrable information. Therefore, there continues to be an insatiable ongoing appetite for efficient transfer and storage of data information.
Those familiar with the art of data compression appreciate that there are two major types of compression: lossy and lossless. In a lossy compression system, portions of the data that are determined to be less necessary are discarded making exact reconstruction or decompression of the signal impossible. Lossy compression is employed for physical signals such as speech, audio, images and video in which exact reconstruction of the original signal is usually not required for perceptive acceptability. Since such signals as those immediately described above are generally destined for human perception, such as by the human auditory or visual systems, minor differences between the original and reconstructed signals may either be undetected by human systems or tolerable in their degraded state.
In contrast, lossless compression enables an exact reconstruction of the original signal to be performed upon decompression. That is to say, lossless compression achieves a perfect recreation of the original signal without the degraded or compromised characteristics of lossy compression techniques. One of the penalties of employing lossless compression which yields perfect reconstruction is that the compression ratio or the ability to compress a larger number of data bits into a smaller number of data bits is greatly reduced. For certain types of data information, it is imperative that perfect reconstruction of lossless data compression be employed rather than the compromised reconstruction approach characteristic of lossy compression techniques. For example, computer data must be precisely reconstructed otherwise disastrous effects will occur. Therefore, lossless compression techniques must be employed for the compression of computer data used in computer communications.
With the voluminous amount of data exchanged over computer networks, it is apparent that data compression is a useful feature in computer communications. For example, in the transmission of highly compressible data, V.42bis might lead up to a nearly four fold improvement in speed. Such a compression factor yields sizable cost savings when the transmission is over a toll line such as a long distance telephone connection or over a wireless telephone connection such as a cellular telephone connection. The cost and time savings as well as the benefit of having a faster connection are not negligible even over a local or non-toll telephone connection. Such improvements are highly desirable resulting in a constant demand for better and improved lossless data compression algorithms.
Several approaches have been purposed and promulgated for lossless data compression. One such approach, commonly known as the "sliding window" approach, is an earlier modem dictionary-based compression scheme. In such an approach, a block of the most recently received data is stored. The window is continuously updated by dropping the oldest character in the window as a new character is being received. Upon receiving a new character, the current window is searched for a matching string. When a matching string is located in the window, a pointer to the string in the window is sent to the decoder. An advantage of this compression technique is that the decoder does not need to perform string matching. The disadvantage of such an approach is that string comparisons against the look-ahead buffer must be performed for every position in the text window. Some improvements to the sliding window algorithm are also known, but all of these algorithms are biased towards exploiting the most recently encountered text. A second major problem with such an approach is the limited size of a phase that can be matched, due to the limited size of the look-ahead buffer. It is known in the art that increasing the size of the sliding window and the look-ahead buffer does not alone solve such problems because such an approach leads to huge computational complexity problems and the algorithm cannot be performed in real-time.
The International Telecommunications Union (ITU), previously known as CCITT, promulgated an improved encoding scheme known as the V.42bis standard for computer communications over a telephone network. The V.42bis standard is a textual substitution scheme based on Ziv-Lempel's second algorithm describing that wherever a code word is transmitted, the dictionary is augmented by a new code word consisting of the code word being transmitted concatenated with the next character in the input stream. Initially, such a dictionary consists of all one-character strings. It is apparent that strings of characters with repeated substrings are efficiently coded by such an approach. For example, English text is moderately repetitive and employing the V.42bis achieves a compression ratio of about 2.3 on English text.
Originally, Ziv and Lempel suggested an algorithm that outputs a code word for the longest string match and the character that broke the match. The inventor, T. Welch, and U.S. Pat. No. 4,558,302, made some important improvements to the original window approach. In particular, the encoder in that implementation outputs only code words which required that the dictionary be initialized with all possible one-character strings. Such an improvement is very suitable for real-time applications such as data communications involving a modulator-demodulator (modem).
Yet another data compression scheme which has been proposed for computer communications over a telephone network is known as the Microcom Networking Protocol (MNP-7). Such a protocol associates pairs of characters with a code word and in one implementation assumes 8-bit characters yielding a total number of possible characters of 256, however, the total number of all possible combinations of two characters results in an infeasible size for storage within an embedded system such as a modem. Actual implementations using such a protocol typically result in the storage of the most common pairs in memory of approximately 1024 bytes in size.
The present implementation of V.42bis incorporates additional functionality as proposed in Miller and Wegman's idea of gradually increasing the size of the dictionary. In such an implementation, upon initialization or reinitialization there exists 512 code words in the dictionary with a code word size of nine bits. Such an implementation also proposed two modes of operations: compressed mode and transparent mode. Compression and dictionary updating is performed in both modes, however, in transparent mode, characters are sent as they are received. In both modes, there are commands which can be transmitted to the decoder. In transparent mode, a two-character sequence, beginning with the special character escape, followed by the command ECM (e.g., enter compression mode) produces a signal to the decoder to shift or transform into compressed mode. In compressed mode, there are three separate control code words. The control code word, ETM (e.g., Enter Transparent Mode) produces a signal to the decoder to switch to transparent mode.
As described above, the compression ratio achieved by an algorithm in compressed mode depends upon the compressibility of the data. The same reality exists for the compression algorithms in the V.42bis standard. For highly compressible data, the compression ratio can approach approximately 4. However, if the data in question is not compressible, the algorithm employed by V.42bis in compressed mode not only does not provide compression, but can lead to data expansion. Such a situation would arise if, for example, the input data to the compression process is data that has already undergone a compression process. If highly incompressible data is encountered, it is advantageous to transmit such highly incompressible data (e.g., previously compressed data) in transparent (i.e., not additionally compressed) mode. It should be apparent that transparent mode does not provide any data compression and therefore will not result in any data expansion.
While the V.42bis standard provides several requirements for the implementation of the standard, many aspects of the standard are under-specified. For example, the V.42bis standard does not specify when and how the encoder may decide to switch from transparent mode to compressed mode and vise versa. The standard also does not address when and how an encoder decides upon discarding an entire dictionary. As a result of such ambiguities and under-specification, different implementations or flavors of V.42bis vary in performance between implementing manufactures causing varying levels of compression efficiency. That is to say, one manufacture's implementation of the V.42bis standard is not necessarily as efficient as another manufacture's implementation of the standard.
The use of data communications has become much more popular since the adoption of the V.42bis compression standard incorporation into modem modems. Additionally, since the advent of the Internet, the use of modems for data communications has exploded. For users employing their modem in accessing the Internet, they must employ or run a point-to-point (PPP) or a serial-line interface protocol, or SLIP stack on the computer hosting the modem. Those familiar with Internet usage and the associated protocol appreciate the multimedia and hence multi-data type of computer data transferred between the Internet and the user's modem. The effect of such use of these protocols is that several types of data are multiplexed. Thus, consecutive packets can have a very different compression characteristic or capability. Such conditions result in a reduced compression performance for V.42bis modems. Therefore, there is a need to improve the compression performance when several types of data are multiplexed over one modem line.
One suggested approach, has been to develop a compression algorithm different from the widely accepted V.42bis modem standard. While such an approach may prove meritorious in that a standard may be developed specifically for handling such conditions, the fact remains that the large majority of modem installations presently employ the V.42bis standard and the development of an incompatible standard would result in the inability for a large modem installation base becoming obsolete or unable to communicate with newer-protocol modems. Therefore, there is a need to retain compatibility with V.42bis protocol standards while attempting to improve the compressibility of multidata type data streams exchanged between modems.