String based data compression algorithms operate by identifying a string of recurring characters in a data stream and substituting a short codeword for the string. Many schemes and devices for data compression are known in the art. The most common data compression algorithms attempt to maximize the number of characters in the data stream for which short codewords are substituted. Broadly speaking, strings of characters are stored in an "encoding table" or library. Each string in an encoding table is associated with a codeword. The stream of data characters are compared to the strings in the encoding table. If a portion of the data characters in the stream of data characters match a string of characters in the encoding table, a codeword is substituted in the stream of data for the data characters. The greater the number of characters from the data stream that are matched to a string of characters in the encoding table, the higher is data compression. Thus, the more closely the strings of characters in the encoding table match the incoming data, the greater is the data compression.
Perhaps the easiest manner of making the encoding table most closely match the data stream is by pre-scanning the data to be transmitted into the encoding table. Theoretically, this would form the optimum data compression algorithm. Theoretically, one could pre-scan data to be transmitted and formulate the optimum scanning algorithm.
However, in real time applications such as data communications involving a modulator-demodulator ("modem"), it is not always possible or desirable to pre-scan data and establish the most efficient data encoding scheme. In such applications, the nature of the data may change, rendering a fixed compression scheme inefficient or even detrimental. Thus, a compression scheme that adapts in real-time to the nature of the data is desirable.
Most prior art adaptive real-time data compression methods rely on dynamically created tables in both an encoding device and a decoding device. In U.S. Pat. No. 4,612,532, issued to Bacon et al., a series of characters in a data stream are encoded in accordance with dynamically created tables in the encoding device, and the decoding device is constructed in a manner to create corresponding tables for decoding the encoded data, relying on the structure of the encoded data to create the decoding tables dynamically. Primarily, the encoding device relies on the assumption that a given character in the data stream has a given probability of being followed by one of a set of probable candidates for the next successive character. Accordingly, the encoding device creates a table in which, for a given character, there is presented a list of candidates, in approximate order of frequency of occurrence, for the next successive character that would occur in the data stream. When the given character occurs in the data stream, followed by a character in the table, the encoding device sends a binary code to represent the latter character based on the character's ordinal position in the table. This code is the shortest for the most frequently occurring candidates and longer for the candidates that are less frequently occurring. This table is created and updated based on the local frequency of occurrence of a character after a given character, thereby allowing the table to be changed dynamically as the nature of the local data changes.
Another encoding scheme was promulgated by the International Telegraph and Telephone Consultative Committee ("CCITT") as the V.42bis standard, which is incorporated by reference herein. The V.42bis standard operates similarly to the above, except that several characters are encoded using a single codeword. The V.42bis standard labels each character in the alphabet a "root character." The characters that might occur after or follow a root character are called strings or branches. For example, please refer to the prior art encoding scheme set forth in FIG. 1, and each of the root characters 4 "A", "B", "C" and "D". The characters depending from root character "B" are "A" and "I." Furthermore, there are characters depending from the characters "A" and "I". Thus, the words "BAG" and "BIN" are spelled out in strings beneath the root character "B." Each group of characters depending from a root character, such as "DE", "DO and "DOG", are referred to as strings 5.
Each of the strings in the encoding table is associated with a codeword. If the "BAG" sequence of data characters is detected in a data stream matched with the "BAG" string in the encoding table, a binary codeword representing the word "BAG" is transmitted in place of the word "BAG". Together, all the branches depending from a single root character are called a tree 6. A tree can be many shapes, such as short and wide or long and narrow.
In the V.42bis standard, the nature of the strings 5 in each tree are dynamically altered depending on the data in the data stream. Thus, characters in the encoding table for more recently used words are kept and grown based upon the data in the data stream, while less recently used leaf characters in trees may be pruned. For example, the string in FIG. 1 with the word "BAT" might be extended to encode the word "BATTLE". Therefore, a dictionary of the most recently used words is created and stored in the encoding and decoding tables in real-time.
Another data compression scheme familiar to those skilled in the art is the Microcom Networking Protocol-7 (MNP-7). This protocol associates pairs of characters with a codeword. Theoretically, one could associate pairs and, over time, build a library containing all possible combinations. If each of the 256 possible characters were combined into all possible pairs, it would take about one-half of a megabyte of memory to store the combinations. Therefore, the most common pairs are typically kept in memory of 1024 bytes in size. The pairs are rotated out of memory as new pairs are added to memory. Because of practical limitations on memory size, the MNP-7 is kept to a limited size and the entries in that table are selected based on the theory that recently used pairs will be repeated.
The "sliding window" approach to data compression is also familiar to those skilled in the art. A block of the most recently received data is stored (the window). As characters are received, the "oldest" character in the window is dropped out. Thus, the window is continually updated with new data. After each character is received, the current window is reviewed for a matching string. If a string is located in the window, a pointer to the string in the window is sent to the decoding table. The decoding table uses the pointer to access its own duplicate window and decode the characters.
Those skilled in the art will understand that the above data compression systems operates upon the assumption that data characters are non-random. The schemes assume that if a word occurred once, it will likely occur again. Further, the V.42bis relies on the inherent correlation, such as defined by rules of grammar, between characters in human language.
In all of the above data compression schemes, a larger encoding table allows a greater number of strings and/or longer strings to be stored, and the greater the number and length of strings stored in the encoding table, the more likely that a stream of data characters will be matched with a string in the encoding table. Therefore, it is frequently true that the larger the encoding table the greater the data compression.
The width and length of total tree branches in any scheme is limited by the total memory space allocated to the encoding table. In modems using the V.42bis standard, the memory space typically allocated to the encoding table is initially set to hold 2048 entries or strings, in memory of 32, 768 bytes.
Other than the need to limit memory space consumed storing the encoding table, the size of the encoding table is also limited by two trade-offs. First, the modem must review each string under a root character to find a string matching the characters in the data stream. This process of reviewing each character in the character stream and trying to match the characters in the stream with a string in the encoding table takes significant processing time. Thus, limiting the size of the encoding table recognizes the inherent trade-off between having a large encoding table, thereby matching more strings to the data stream to achieve greater data compression, versus the slow-down in execution time, and thus data throughput time, caused by using a larger encoding table.
The second tradeoff is that increasing the size of the encoding table requires an increase in codeword size. For example, an eight bit codeword can represent, at most, 256 (2.sup.8) strings. Thus, if an encoding table is increased in size to contain more than 256 strings, a larger codeword must be utilized. Thus, although more and longer strings in an encoding table may increase compression by matching more strings, the demand for an increased number of codewords leads to increased codeword size that reduces throughout and the compression ratio. On the other hand, a smaller encoding table allows for fewer strings, but codeword length is reduced.
As will be familiar to those skilled in the art, the above data compression/encoding schemes assume, and are most suitable for, applications wherein the data is highly correlated, that is, it is predetermined that the file being transmitted is text or numbers (but not both), and wherein the data is not random. However, in applications wherein the type of data is not predictable and changes over time, such as a real-time communications system, the above methods will employ a less than optimum code. The above compression schemes only adapt to changes in the correlation between data by reallocating codewords in real-time within a fixed-size encoding table.
Accordingly, there is a need to provide an adaptive data compression method which achieves greater dynamic adaptability relative to any type of data or distribution of types of data.
Furthermore, there is a need for a real-time data encoding scheme that balances the need to limit the processing time devoted to attempting to compress random or pseudo-random data, and yet achieves a high rate of data compression for non-random (correlated) data.
Furthermore, there is a need for an encoding scheme that eliminates passing long strings of identical characters into the encoding table. Specifically, when a string of identical characters are received in a modem, the modem will alter the encoding table to contain a very long string of the identical characters that will be compressed before transmission. This long string of characters is a distortion to the encoding and decoding tables, and frequently causes inefficiency to users of the V.42bis standard. Specifically, valuable space in the encoding table is used to store the string of identical characters, which string is usually useless during data transmission.