The present invention relates to a system and method for compressing data and more particularly to a system and method for compressing data that is stored or that is transmitted between remote points using dynamically generated compressed data code.
In recent years, significant effort has been expended in the development of systems and techniques for compressing data, either for data storage or for transmission from one unit of data terminal equipment (DTE) to another. Most codes used for data transmission, such as ASCII and EBCDIC, do not lend themselves to efficient packing of the most information into the least number of bits principally because they were designed for simplicity and not for efficiency.
Most codes use the same number of bits to send each character (generally seven or eight bits each). As a result, common characters like the space take just as many bits to send, and therefore just as much time, as infrequently sent or stored characters like the ampersand (&). Many data compression schemes reduce this transmission time by coding commonly sent characters with fewer bits than rarely sent characters. In other words, in a text file the letter "E" which is generally sent or stored more frequently than most of the other characters is sent in just two or three bits, while a character such as the exclamation mark (!) might be sent in ten or twelve bits. The net result is that on average, such systems take less time to send or store complete files than if the same number of bits were used for each character.
One well known compression technique is Huffman Coding, which provides optimal coding for discrete symbols from a limited set given two constraints. These constraints are that (1) the probability of the symbol occurring is independent of the preceding symbols, and (2) the probabilities of each symbol occurring are known. (The term "symbols", when used in connection with data compression, refers to the items to be coded, either for transmission or storage.) The coding scheme requires a finite set of these characters, and the probability of each symbol occurring must be known. Huffman Coding utilizes probabilities of the occurrence of each symbol in order to assign a unique bit sequence to code each symbol, with the more likely occurring symbols getting shorter bit sequences.
Huffman Coding requires that the probability of each symbol occurring not be influenced by the sequence of preceding symbols. In other words, in English text since certain characters such as "U" always follow a "Q" the likelihood of occurrence of each symbol is not represented solely by the overall probability of the occurrence of that symbol, and for this reason coding schemes that do not take such factors into account are not perfectly optimal. A greater optimization of the coding may be obtained by choosing larger symbols, such as "QU". This optimization, however, is achieved at the expense of a much more complex and time consuming system which would not be practical if it is to be used in a data communications device which is intended to operate in real time.
An even greater drawback associated with systems utilizing Huffman Coding is the requirement that the probabilities of each symbol occurring be known. This constraint requires that a Huffman Coding system be essentially a two pass process. The first pass examines the data to determine the probability of occurrence of each symbol with the Huffman Coding based on the probabilities, and the second pass actually encodes and transfers the data. To achieve a worthwhile system, data must be held back from transmission in order to get enough probability data to be statistically significant. The holding back of data, however, introduces transmission delays unacceptable in full duplex, interactive communications. Further, the coding tables would also be required to be sent to the receiver in order to decompress the data, and any bits required to transfer coding tables must be subtracted from the gains made from the compression process itself.
It is therefore a principal object of the present invention to provide a system and method for compressing data in one pass that is fully adaptive so that the transmitter determines which characters are most common and to what degree.
Another object of the present invention is to provide a system and method for compressing data that operates without static probability tables and without the transmission of coding data.
It is a further object of the present invention to provide a system and method for compressing data that will efficiently operate with half or full duplex modems used for batch or interactive communication.