There are a wide variety of data compression techniques currently utilized in the field of telecommunications. Each of these different approaches has its own unique characteristics and advantages.
In accordance with one such approach known as the MNP5 coding algorithm, an encoder takes a stream of data bits and converts this data to a code for transmission. Essentially two distinct compression algorithms are utilized by MNP5 coding. The first of these is simple run length encoding. In run length encoding the system basically identifies fixed length runs of identical characters and then replaces the string of identical characters with exactly three of those characters and a count of the additional like characters that have been removed. For example, “n” adjacent blank spaces in a file would be transmitted to the receiving side as 3 blank spaces and the result of the equation “n—3”.
Subsequent to the initial run length compression, first order Huffman encoding is applied. Huffman encoding monitors the frequency of occurrence of each character in the data stream and updates an ASCII table accordingly. Each character position in the table is assigned a unique bit pattern from 4 bits to 12 bits to represent it. To begin with the tables in the transmitting modem and the receiving modem are identical. As a character is used more frequently, it moves up in the table, thus reducing the number of bits required for its representation. The transmitting side encodes the character based on the current table and then updates the count for that character. If the updated count causes a shuffling of the table it is performed after the character is transmitted. On the receiving end, the character is decoded prior to updating the count of that specific character in the table. Once the count is incremented the table will be re-sorted accordingly and thus will remain in synch with the transmitting side. This eliminates the need to separately transmit data regarding changes in the table. Compression is thus achieved through utilizing fewer numbers of data bits to transmit the more frequently occurring characters.
Another type of data compression performed by conventional modems pursuant to the V.42 BIS modem communication standard is the Lempel_Ziv compression technique that was originally developed in the late 1970's through a variety of technical publications. In Lempel_Ziv compression, variable length strings of data are converted into fixed sized code words and stored in a dictionary. Initially the dictionary is populated with 256 code words, one for each of the characters in the ASCII character set. Each codeword is transmitted using a fixed number of bits depending on the dictionary size. 11 bits are needed for a dictionary size of 2048 entires.
Thereafter, as a stream of data characters are transmitted by the system, additional table values are determined. These additional table values are determined based on the specific sequence of the characters within the data stream that is to be transmitted. For example, in accordance with the V.42 BIS standard, transmission of portions of the data stream “Hello Henry” would take place as follows.
The first occurrence of the “H” would simply be transmitted as the 11 bit codeword set forth in the original dictionary that corresponds to the ASCII representation of that capital “H”. Transmission of this single character would not alter the dictionary. The next character, “e” is then transmitted as the 11 bit codeword that corresponds with the ASCII representation of “e” in the dictionary and the string “He” is added to the dictionary. The next step is to transmit “l”. The string “el” is then added to the dictionary. The letter “l” is then transmitted again and “ll” is added to the dictionary. Transmission continues in this manner until the string “He” is encountered as part of the word “Henry.”
At this point, the 11 bit codeword for the string “He” is then transmitted, thus eliminating the need for transmitting a character. Because the table at the decoder side is built in parallel with the encode table, and based on the same protocol, the 11 bit codeword is readily translated into the proper string. The table continues to build in this manner with the new dictionary entries being created always from the longest string that can be matched plus the next character that was transmitted. A string is not added if a string longer than the maximum negotiated string length would be created. Once the dictionary is full, old strings are then deleted when new strings are added.
This technique essentially is comprised of two primary data structures. One of the data structures is a dictionary that records strings and the second is a history buffer that records the last number of characters passed through the data compressor. Dictionary entries reference an index in the history buffer. Due to the fact that the characters are continually being added to the buffer and the buffer is of a finite size, eventually an index will age and become invalid. Furthermore, strings are constantly being added to the dictionary. Eventually either the history buffer or the dictionary will become full. Both situations must be handled by the system. One conventional approach to solving this problem is to simply clear all of the data structures including the history buffer as well as the dictionary. However, this approach degrades the overall compression ratio.
In this approach, as the dictionary fills up and then empties in a repetitive saw-tooth cycle, over time the dictionary averages being approximately half-full or about 50 percent utilization. In order to encode a dictionary entry with this approach the system requires log base 2 (dictionary size) number of bits. When only half the dictionary is in use, then it is possible that an equivalent dictionary with half the size and 100 percent utilization could be used and would require one less bit. The use of one less bit would improve compression performance by reducing the overall size of the compressed output.
For example, consider that a 1024 entry dictionary requires 10 bits in order to encode an entry, while a 512 entry dictionary requires only nine bits to encode an entry. When only 512 elements are used, from a 1024 size dictionary, then in theory a 512 size dictionary could have been utilized instead of the 1024 size dictionary. Had the smaller dictionary been utilized, then the compression performance would have improved by saving extra bit, 9 vs 10.
In order to maintain a full dictionary, it is necessary to adopt one of either two alternative strategies. The first of these approaches is to stop adding new strings when the dictionary is full. This provides the advantage of simplicity but stops further adaptation to the data stream, resulting in diminished performance over time.
Another solution is to make space in the dictionary for new strings by removing old strings. Preferably the strings removed will be the least recently used strings. This approach, however, is greatly complicated due to the fact that dictionary entries record the index into the history buffer. This buffer is of a finite size and therefore the index will age and become invalid as characters are continually being added to this buffer. Optimal performance can be obtained only by eliminating entries when their indices become invalid or when needed in order to create room for a new string.
For example, the existing V.44 algorithm involves attempting to find a string of previously processed characters that matches the next characters to be processed. If such a string exists, the codeword representing the string is transmitted. Characters which are input are placed in the next available locations in the history file by the following criteria. Using the input character and the next character, the encoder attempts to find the longest matching string in the dictionary and transfers the code word on a match. If a string match is not found, the encoder transfers the ordinal corresponding to the input character and attempts the above again. The encoder attempts to extend the longest string match up to the maximum string length. If the longest string match is extended by one or more characters, the encoder transfers a string extension length indicating the number of characters by which the string has been extended. After finding the longest string, the encoder then returns to try to match the next character. Several modifications to the existing V.44 algorithm simplify this task, but even with these modifications, optimal performance using all the contents of the dictionary effectively remains an issue.
Thus there is a need for a method and system of data compression which provides optimal performance of a compression dictionary by eliminating unused entries to create room for new entries. There is also a need for a compression technique which allows elimination of dictionary entries which are not used. Finally there is a need for a compression technique which will make data compression more efficient by keeping often used dictionary entries.