Many digital communications systems send data in packets. These packets contain headers at the start of the data. The header comprises information relating, e.g., to the destination address of the packet, the length of the packet and the type of payload data contained inside. The header can be considered a long character comprising a string of bits.
Mobile telecom networks and the Internet are converging in terms of their functionality. It is desirable for third generation mobile handsets to understand Internet (IP or ATM) data packets directly to allow for seamless email, web browsing and multimedia services to the mobile user. Protocols such as IP are designed to run on fixed networks where bandwidth is plentiful, and so they are costly in the mobile phone environment. When used to carry speech, the overhead resulting in using IP can be up to 75% of the total network capacity, which is unacceptable for mobile networks.
One solution to this problem is to compress the IP header just before it crosses the air interface. A number of compression schemes exist for this purpose (Van Jacobson, CRTP etc.), which variously make trade-offs between efficiency, flexibility and simplicity.
Known data compression systems include the Huffman Algorithm. This publicly available standard is widely used in many compression schemes including “WinZip”. Huffman encoding compresses a data stream one character at a time, where a character is usually one byte. The basic compression is not very efficient, but it is possible to obtain better results by applying the method recursively or by increasing the size of one character. However, this increases the processing and/or memory requirements of the algorithm.
In order to understand the invention the prior art will now be explained.
Ordinary Huffman
Huffman encoding is a publicly available compression standard used in many popular compression schemes such as “WinZip”. All Huffman compressors work on a stream of characters (for example ASCII characters). The basic idea is to create a new set of compressed characters or codes, where each normal character maps onto a compressed character and vice versa. Frequently occurring, i.e. common characters, are given shorter compressed codes than rarely used characters, reducing the average size of the data stream. The compression ratio can be improved by increasing the size of one character, but at the expense of higher memory requirements. In fact the memory used when running a Huffman compressor grows exponentially with the character size, so 16-bit characters need 256 times as much memory as 8-bit characters.
FIG. 1 illustrates how ordinary Huffman works. In the example, it relates to 10 different possible characters (a set of 10 ASCII characters) as shown in single inverted commas (in general a character can be anything e.g. a byte, a header, an ASCII character etc). A prerequisite is to know, for the characters, the approximate probability of that character turning up in the data sequence, the skilled person would understand that this can be done in any appropriate way (e.g. a large stream of characters is taken and one determines how often each character appears).
In the worked example the ordinary Huffman tree needs 10 starting nodes, one for each possible character. These nodes are plotted at the top of the Huffman tree, together with the percentage chance that the character turns up in an uncompressed data-stream. The characters are ordered generally in terms of increasing probability. The space character is a very common character and put last. As shown in the figure, the box underneath each character shows the probability of occurrence. To build the tree, the two nodes with smallest probabilities are joined up to form a new node. The left-hand branch is labelled with a “1” and the right hand branch with a “0”. The new node is obtained with a probability of the combined root nodes (in the first case this is 6%). This process continues until there is only one node left, at which point the tree is finished. In general, the branch with smallest probability is labelled with a “1”, and the second smallest with a ‘0’. The sum of these two probabilities is placed in the new node. The completed Huffman tree for the worked example is shown below:
To compress a character one starts at the correct node and follow the tree down, reading off the ‘1’s and ‘0’s as they occur. The string of bits that this generates is the compressed character. e.g. “E” and follow the tree down to its root; this gives 0001. Thus E is represented by a 0001.
The compressed character is sometimes written backwards, so E is represented by 1000. This makes it easier to decompress (because we can follow the tree up by reading the compressed character from left to right).
Similarly, to decompress a character just follow the tree up using the compressed string of bits to decide whether to branch left or right at each node. Eventually one of the original ten nodes is reached and the correct decompressed character is discovered.
As can be seen, common characters are represented by fewer bits; a “space” character is represented here by a 0.
Improved Huffman
In a well-known enhanced method of compressing a stream of characters based on Huffman, each character is assigned a group and it is the groups which are treated as characters of the conventional Huffman algorithm. The method has significantly lower memory requirements than ordinary Huffman, allowing the size of one character to be increased and hence giving a better compression ratio. The improved Huffman method also uses a “character group” rather than the characters themselves to build a tree; the groups effectively become the characters of the ordinary Huffman.
The improved Huffman tree is constructed in two stages. In the first stage the characters are divided up into groups according to a common behavior pattern. A behavior pattern may e.g. be the same probability, so characters are grouped according to their relative frequency.
The problem however is that in a compressed character, the Huffman code for the group must be followed by a bit pattern identifying which character within the group has been compressed. If the group does not contain a power of two characters then bit patterns are wasted, giving poorer compression efficiency. The inventors have determined a method which overcomes these problems.