Various schemes and devices for data compression are known in the art. Many known techniques relay upon statistics which relate to the frequency of occurrence of elements or "characters" in a data set. Typically, a set or file of data to be compressed must be prescanned or processed in order to accumulate the statistics of occurrence of the characters in the data set, preparatory to assigning codes for representing the characters. Then, shorter codes are assigned to the more frequently occurring characters and longer codes are assigned to the less frequently occurring characters.
In the well known Shannon-Fano code, an input file to be compressed is analyzed, and the probabilities of occurrence of particular characters are arranged in descending order. The set of characters is then divided into subsets of equal or almost equal total probability, and a zero is assigned as the first code digit in one subset and a one is assigned as the first code digit in the second subset. These steps are repeated until each subset contains only one character. Codes produced by this method are instantaneously decodable since no code word is a prefix or any other code word.
The popular Huffman code also has the property of instantaneous decodability, but employs a coding tree. In this method, which yields a provably minimum average word length, again the source or input data file must be prescanned, and the probabilities of the characters arranged in descending order. The two lowest probabilities are combined and a probability tree is constructed by placing the higher probability branch on top. A zero is assigned to the upper member and a one to the lower member of each pair on the tree. Then, the path from each item's probability is traced to the unity point, recording the one's and zero's along the path. Then, the resultant code is the one-zero sequence obtained. Needless to say, analysis of the entire data set is required before the codes can be assigned.
While these data compression techniques are suitable when there is ample processing time available for prescanning the data, in real time applications such as data communications involving a modulator-demodulator ("modem") it is not always possible or desirable to prescan data and establish the most efficient code. In such applications, the nature of the data may change, rendering a fixed compression scheme inefficient. For example, modems may transmit text files, graphics files, mixed text and graphics, software object code, spreadsheet files, interactive communications with other systems, or other types of data. In order for compression to be effective, prescanning and/or a separate encoding scheme would be required for each different type of data expected. These types of data can change unpredictably, even within a given transmission. In many instances, a less-than-optimum code may be quite acceptable to provide reduced overall transmission or data storage time as well as compression.
Accordingly, there is a need for a data compression technique which is suitable for use in a modem or other real time data compression applications, wherein the types of data can change frequently and even during a transmission.
One prior art adaptive real time data compression method relies on dynamically created tables in both an encoding device and a decoding device. In U.S. Pat. No. 4,612,532 of Frances L. Bacon et al., a series of characters in a data stream is encoded in accordance with dynamically created tables in the encoding device, and the decoding device is constructed in a manner to create corresponding tables for decoding the encoding data, relying on the structure of the encoded data to create the decoding tables dynamically. Primarily, the encoding device relies on the assumption that a given character in the data stream has a given probability of being followed by one of a set of probable candidates for the next successive character. Accordingly, the encoding device creates a table in which, for a given character, there is presented a list of candidates, in approximate order of frequency of occurrence, for the next successive character that would occur in the data stream. When the given character occurs in the data stream, followed by a character in the table, the encoding device sends a binary code to represent the latter character based on the character's ordinal position in the table. This code is the shortest for the most frequently occurring candidates and longer for candidates that are less frequently occurring. The table is created based on the local frequency of occurrence of a character after a given character, thereby allowing the table to be changed dynamically as the local frequency of occurrence changes.
In the Bacon patent, it is assumed that the data being compressed is not random, and that a given character has nore than a random chance of being followed by one of a plurality of probable candidates for the next character. This assumption limits the usefulness of the method to applications such as English language text, wherein it is highly likely that the technique can be successfully used. However, in applications wherein a mixture of text and numeric data is involved (such as a spread sheet file), when the data type shifts to numeric from text, compression efficiency will be lost since numeric files are more likely to be random than text files. Nonetheless, when the data type shifts to numeric, the probability of occurrence of the numeric digits increases dramatically compared to the probability of occurrence of text characters, so that increased efficiencies could be obtained if the compression algorithm could adapt to the changing nature of the data type.
Accordingly, the Bacon method is most suitable for applications wherein the type of data is predictable, that is, it is predetermined that the file being transmitted is text or numbers (but not both), or wherein the data is not random. However, in applications wherein the type of data is not predictable, such as where there is mixed text and numbers, or of unknown type, this method will tend to employ a less-than-optimum code. Accordingly, there is a need to provide an adaptive data compression method which is dynamically adaptable to any type of data or distribution of types of data.