When transmitting data over a communications channel, or when storing data, it is often useful to be able to compress the data, in order to reduce the resources required to transmit or store the data. Data compression techniques rely on the fact that most useful data contains patterns, which can be exploited in order to reduce the amount of data required in order to represent the source data. For example, when the source data contains one or more repeated sequences of characters, each of these can be represented more efficiently by a particular code. Provided of course that the code contains fewer bits of data than the sequence of characters, this representation reduces the amount of data required to represent the source data.
One well known data compression is the LZW (Lempel-Ziv-Welch) data compression algorithm, which is described in U.S. Pat. No. 4,558,302 to Welch. In use of the LZW algorithm, a code dictionary is built up, based on the received data string. The received data can for example represent text made up of characters. Then, each individual character can be represented by one of the available codes, and the other available codes can be assigned to respective character strings, as they appear in the received text. For example, if the text is represented in the well known ASCII format with 256 available characters, one of the available codes is assigned to each of these characters. The number of available codes is then determined by the number of bits in each of the codes. For example, if 12 bit codes are used, there are 212 (=4096) available codes, whereas if 13 bit codes are used, there are 213 (=8192) available codes, and, if 14 bit codes are used, there are 214 (=16384) available codes. Using longer codes therefore allows more codes to be assigned, but also increases the amount of data required to represent each code. The optimum number of available codes will depend on the properties of the data being compressed.
Having initially assigned 256 of the available codes to the 256 ASCII characters, a new code is allocated every time that a character string appears in the code for the first time. For each new code, the code dictionary stores the new code itself, a prefix code (which is a previously assigned code), and an append character (which is added to the character or character string represented by the prefix code to form the character string represented by the new code).
Thus, if the code 65 is assigned to the letter “a”, and the letter “a” is the first character in the received character string, then a new code is allocated when the second character is received. A count is kept of the number of codes that have been allocated, so that a new code can be allocated a code number based on the number of previously allocated codes. For example, if the letter “b” is the second character in the received character string, at a time when 256 codes have previously been assigned to the 256 ASCII characters, then the new code is code 257, and it has the prefix code 65, and the append character is “b”.
When the character string “ab” next appears in the received text, then a further new code is assigned. For example, if this occurs after 100 further new codes have been assigned, and the letter “s” immediately follows that second occurrence of the character string “ab”, then the new code is code 357, the prefix code is code 257, and the append character is “s”.
Thus, the code dictionary is built up as the text is received, until such time as all of the available codes have been assigned.
It is therefore apparent that, when each new character is received, it is necessary to search the code dictionary, in order to determine whether there exists an assigned code which should be used at that point. That is, in the example given above, when the character string “ab” appears in the received text for the second time, and the letter “s” immediately follows, it is necessary to search though the code dictionary, in order to determine whether there already exists a code with the prefix code 257, and the append character “s”.
In order to allow the search of the code dictionary to be performed more quickly, one known technique uses a hash algorithm, although there are many other ways of performing the search. Using this known technique, a table is defined, having a size that is a prime number, and that is larger than the number of available codes. When a new code is assigned, it is stored in the table at an index defined by a hash function, derived from the prefix code and the append character. However, if it is determined that a code is already stored at this index (one of the properties of a hash algorithm being that each different output can be derived from many different inputs), the index is incremented according to some further hash function, and the new code is stored at that index, if possible. It will be appreciated that, as the table fills up, this process may need to be repeated multiple times in order to find a location at which the code can be stored.
When searching the code dictionary, therefore, to determine whether a code with a particular prefix code and append character has been assigned, the prefix code and the append character are combined using the hash function to form a hash value, and this is used as an index to search the table.
If, at this index, there is stored a code having the intended prefix code and append character, the search is complete. If no code is stored at this index, then a new code is created.
However, if there is a code stored at this index, having a different combination of prefix code and append character, it may still be the case that a code having the intended prefix code and append character is stored elsewhere in the table. In order to find this code, the further hash function mentioned above is used to increment the index, and determine an alternative location at which the code may be stored.
Again, as the table fills up, this process may need to be repeated multiple times in order either to find the code having the intended prefix code and append character, or to determine that no such code has been created and stored.
This places a limit on the speed at which data can be compressed, and moreover means that this speed will vary, depending on the nature of the data in the source data stream.