Many digital communications systems send data in packets. These packets contain headers at the start of the data. The header comprises information relating, e.g., to the destination address of the packet, the length of the packet and the type of payload data contained inside. The header can be considered a long character comprising a string of bits.
Mobile telecom networks and the Internet are converging in terms of their functionality. It is desirable for third generation mobile handsets to understand Internet (IP or ATM) data packets directly to allow for seamless email, web browsing and multimedia services to the mobile user. Protocols such as IP are designed to run on fixed networks where bandwidth is plentiful, and so they are costly in the mobile phone environment. When used to carry speech, the overhead resulting in using IP can be up to 75% of the total network capacity, which is unacceptable for mobile networks.
One solution to this problem is to compress the IP header just before it crosses the air interface. A number of compression schemes exist for this purpose (Van Jacobson, CRTP etc.), which variously make trade-offs between efficiency, flexibility and simplicity.
Known data compression systems include the Huffman Algorithm which is discussed in detail in co-pending application. This publicly available standard is widely used in many compression schemes including “WinZip”. Huffman encoding compresses a data stream one character at a time, where a character is usually one byte. The basic compression is not very efficient, but it is possible to obtain better results by applying the method recursively or by increasing the size of one character. However, this increases the processing and/or memory requirements of the algorithm.
In order to understand the invention the prior art will now be explained.
Ordinary Huffman
Huffman encoding is a publicly available compression standard used in many popular compression schemes such as “WinZip”. All Huffman compressors work on a stream of characters (for example ASCII characters). The basic idea is to create a new set of compressed characters or codes, where each normal character maps onto a compressed character and vice versa. Frequently occurring, i.e. common characters, are given shorter compressed codes than rarely used characters, reducing the average size of the data stream. The compression ratio can be improved by increasing the size of one character, but at the expense of higher memory requirements. In fact the memory used when running a Huffman compressor grows exponentially with the character size, so 16-bit characters need 256 times as much memory as 8-bit characters.
FIG. 1 illustrates how ordinary Huffman works. In the example, it relates to 10 different possible characters (a set of 10 ASCII characters) as shown in single inverted commas (in general a character can be anything e.g. a byte, a header, an ASCII character etc). A prerequisite is to know, for the characters, the approximate probability of that character turning up in the data sequence, the skilled person would understand that this can be done in any appropriate way (e.g. a large stream of characters is taken and one determines how often each character appears).
In the worked example the ordinary Huffman tree needs 10 starting nodes, one for each possible character. These nodes are plotted at the top of the Huffman tree, together with the percentage chance that the character turns up in an uncompressed data-stream. The characters are ordered generally in terms of increasing probability. The space character is a very common character and put last. As shown in the figure, the box underneath each character shows the probability of occurrence. To build the tree, the two nodes with smallest probabilities are joined up to form a new node. The left-hand branch is labelled with a “1” and the right hand branch with a “0”. The new node is obtained with a probability of the combined root nodes (in the first case this is 6%). This process continues until there is only one node left, at which point the tree is finished. In general, the branch with smallest probability is labelled with a ‘1’, and the second smallest with a ‘0’. The sum of these two probabilities is placed in the new node. The completed Huffman tree for the worked example is shown below:
To compress a character one starts at the correct node and follow the tree down, reading off the ‘1’s and ‘0’s as they occur. The string of bits that this generates is the compressed character. e.g. “E” and follow the tree down to its root; this gives 0001. Thus E is represented by a 0001.
The compressed character is sometimes backwards, so E is represented by 1000. This makes it easier to decompress (because we can follow the tree up by reading the compressed character from left to right).
Similarly, to decompress a character just follow the tree up using the compressed string of bits to decide whether to branch left or right at each node. Eventually one of the original ten nodes is reached and the correct decompressed character is discovered.
As can be seen, common characters are represented by fewer bits; a “space” character is represented here by a 0.
Improved Huffman
In a well-known enhanced method of compressing a stream of characters based on Huffman, each character is assigned a group and it is the groups which are treated as characters of the conventional Huffman algorithm. The method has significantly lower memory requirements than ordinary Huffman, allowing the size of one character to be increased and hence giving a better compression ratio. The improved Huffman method also uses a “character group” rather than the characters themselves to build a tree; the groups effectively become the characters of the ordinary Huffman.
The improved Huffman tree is constructed in two stages. In the first stage the characters are divided up into groups according to a common behaviour pattern. A behaviour pattern may e.g. be the same probability, so characters are grouped according to their relative frequency.
The problem however is that in a compressed character, the Huffman code for the group must be followed by a bit pattern identifying which character within the group has been compressed. If the group does not contain a power of two characters then bit patterns are wasted, giving poorer compression efficiency. The inventors have determined a method which overcomes these problems.
It is an object of the invention to provide an improved method of compression and subsequent decompression of headers and characters of binary (or other) data units.
The inventor has determined an improved method of compression of digital data which makes use of detecting behaviour patterns in successive data blocks, which allows for efficient data compression. Behaviour patterns are defined as any form of non-randomness and may take any appropriate form e.g. repeats, counters where the counter is incremented by 1, or where data blocks alternate between a small number of values.
The inventor has developed an improved version of the Huffman method which has significantly lower memory requirements than ordinary Huffman, allowing the size of one character to be increased and hence giving a better compression ratio.
The invention comprises a method of compressing data, wherein said data is in the form of discrete units, by determining an overall code specific to the units, comprising the steps of:
a) grouping the units in terms of a common behaviour pattern;
b) for each said group of units, assigning a group specific code, the sizes of said group code being ordered according to the probability of the unit occurring;
c) assigning a unit identifier code which is specific to identify an individual character within the group,
the overall code comprising of the cocatenated group specific and identifier codes, characterised wherein
step c), the identifier code is of the minimum size to allow for each unit which could occur in that group to be assigned specifically.
The invention will now be described in more detail with reference to examples.
Other objects, advantages and novel features of the present invention will become apparent from the following detailed description of the invention when considered in conjunction with the accompanying drawings.