1. Field of the Invention
The present invention relates to a data processing system and method for generating a structured listing of symbols from which encoded data values for those symbols can be determined.
2. Description of the Prior Art
It is often desirable to compress a stream of data at some point during the processing of that data stream. Generally the data stream can be considered as comprising a sequence of symbols, where a symbol is a predefined data element that can be individually encoded during the compression of the data stream. For example, if the data stream represents a text file, then the data representing each individual ASCII character may be considered to be a symbol. Similarly, if the data stream represents an image, then the data representing individual pixels may be considered to be a symbol.
When compressing such a data stream, it is first necessary to generate a structured listing of the symbols from which encoded data values for those symbols can be determined. This structured listing may take a variety of forms, for example a look-up table identifying the encoded data value for each symbol, or a tree structure with the set of symbols forming the leaves. One example of a compression technique which may be applied is the Huffmann compression technique. When performing Huffmann compression, it is necessary to generate a structured listing of symbols in the form of a Huffmann tree, the Huffmann tree enabling encoded data values to be obtained for each symbol.
Considering the example of Huffmann compression, this basically consists of three steps, namely:
1 Counting symbol frequencies in a data stream;
2 Generating the Huffmann tree based on the frequency information for each symbol; and
3 Encoding symbols using the Huffmann tree.
The first and last steps are well understood, and optimal implementations have existed for some time. However, the second step has a significant impact on the efficiency of the Huffmann compression technique, since the algorithms used to build the Huffmann tree are typically relatively slow. For example, often an algorithm having a complexity of order N2 (referred to as a O(N2) algorithm) is used to build the Huffmann tree, where N is the number of symbols. Hence, as N becomes larger, the time taken to build the Huffmann tree increases proportional to N2. More advanced algorithms do exist which use a priority queue in order to achieve lower build time of O(N log N).
However, it is still desirable to further decrease the time taken to generate the structured listing of symbols, for example the Huffmann tree in the case of Huffmann compression. It will be appreciated that better compression will be achieved if the structured listing of symbols is generated dynamically based on the particular data stream to be compressed. However, the benefits of doing this have to be weighed up against the overheads involved in generating the structured listing of symbols dynamically. Hence, in certain implementations where the overhead in generating a dynamic structured listing of symbols is considered too great, a structured listing of symbols is instead generated on the basis of a sample data stream, and compression of subsequent data streams is then obtained using that structured listing of symbols. Clearly if the structured listing of symbols could be generated more efficiently, then this would increase the number of implementations in which it would be acceptable to generate the structured listing dynamically based on the actual data to be compressed, thereby enabling significantly improved compression to be obtained.
Accordingly, it is an object of the present invention to provide an improved technique for generating a structured listing of symbols from which encoded data values for those symbols can be determined.
Viewed from a first aspect, the present invention provides a method of generating a structured listing of symbols from which encoded data values for those symbols can be determined, the method comprising the steps of: (a) for an input stream of symbols, generating a first list having a plurality of entries, each entry identifying a symbol in the input stream and the frequency with which that symbol appears; (b) ordering the entries in the first list by frequency; (c) selecting the two symbols having the lowest frequency; (d) generating a new symbol to represent the two selected symbols, and allocating the new symbol a frequency based on the two selected symbols; (e) storing the new symbol as an entry in a second list along with an indication of the frequency allocated to the new symbol; (f) arranging for the entries for the two symbols selected at said step (c) to be unavailable for subsequent steps in the generation of the structured listing; and (g) repeating the steps (c) to (f) until only one available entry remains, in each iteration the two symbols selected at said step (c) being chosen from all available entries in the first and the second list.
In accordance with the present invention, a first list is generated having entries identifying symbols and the frequency with which those symbols appear, the entries in the list being ordered by frequency. Then the two symbols having the lowest frequency are selected, and a new symbol is generated to represent those symbols, the new symbol being allocated a frequency based on the two selected symbols. Generally the frequency assigned to the new symbol will be equal to the sum of the frequencies of the two selected symbols represented by the new symbol.
Then, in accordance with the invention, the new symbol is stored as an entry in a second list along with an indication of the frequency allocated to that symbol, and the entries for the two symbols that were selected to be represented by that new symbol are then deemed unavailable for subsequent steps in the generation of the structured listing. This process of selecting the two symbols with the lowest frequency, and replacing them with a new symbol stored in the second list is repeated until only one available entry remains. In each iteration, the two symbols with the lowest frequency are chosen from all available entries in the first and the second list.
It will be appreciated by those skilled in the art that whilst the first and second lists may be provided separately, they could be provided by a single list logically partitioned to provide the first and second lists.
In accordance with the above approach, it will be appreciated that the original set of symbols in the input stream need only be ordered once, since the entries in the first list then remain frequency ordered throughout the generation of the structured listing. There are a number of known O(N) sorting algorithms that can be used to perform this initial ordering. Further, in accordance with the present invention, all of the new symbols generated are placed in a separate list, and since each new symbol generated will have a frequency greater than any of the new symbols previously generated (since in each iteration the two symbols having the lowest frequency will have larger frequencies than the two symbols selected during the previous iteration), then it can be seen that the entries in the second list can be readily maintained in frequency order.
Given that the frequency ordering of both lists is readily maintained, the number of entries that need to be reviewed in order to find the symbol having the lowest frequency is independent of the total number of symbols N, and in fact the smallest symbol can be selected in O(1) time, i.e. in a single time step.
If there are N symbols initially, then since each iteration removes two symbols, and inserts one new symbol, the method of the invention takes N iterations to complete. Since selecting the two smallest symbols takes O(1) time, and inserting the new symbol takes O(1) time, it can be seen that in accordance with the present invention, the structured listing of symbols can be generated in O(N) time, and hence as N becomes larger, the time taken to generate the structured listing only increases proportional to N.
In preferred embodiments, a first end of the first list contains the entry for the symbol in the input stream with the lowest frequency, a first end of the second list contains the entry for the new symbol in the second list with the lowest frequency, and at said step (c) the two symbols with the lowest frequency are selected from the available entries nearest the first ends of the first and second lists.
Preferably, the second list is frequency ordered by arranging that each new symbol added to the second list is stored in an entry further from the first end of the second list than any other entries in the second list. As mentioned earlier, this enables the second list to trivially remain sorted, since the sum of the frequencies of the two smaller symbols monotonically increases during each iteration, and hence each new symbol generated will have a frequency higher than any new symbol generated in previous iterations.
In preferred embodiments, a series of pointers are used to assist in the selection of the symbols having the lowest frequency. In particular, a first pointer is preferably arranged to point to the available entry nearest the first end of the first list, and a second pointer is arranged to point to the available entry nearest the first end of the second list, a symbol being selected at said step (c) by comparing the frequencies of the symbols identified by the first and second pointers.
Further, in preferred embodiments, a third pointer is arranged to point to a location in the second list at which an entry for the next new symbol is to be stored, and if the second and third pointers point to the same location in the second list, then a symbol is selected at said step (c) by selecting the symbol identified by the first pointer. During the first iteration of the method, it will be clear that there are no entries in the second list, and accordingly the second and third pointers will point to the same location during this first iteration. Using this preferred approach, it is clear that the symbols selected at said step (c) during the first iteration can merely be selected by selecting the symbols identified by the first pointer.
Assuming the first list contains N symbols, then in preferred embodiments a fourth pointer is arranged to point to a location displaced by N entries from the entry initially pointed to by the first pointer, wherein when the first pointer and the fourth pointer point to the same location, there are no more available entries in the first list, and any further symbols selected at said step (c) are selected from the second list. This approach ensures that when there are no more available entries in the first list, then the process merely continues by selecting available symbols from the second list until such time as there is only a single available entry left.
It will be appreciated that there are a number of ways in which it can be determined when there is only a single available entry left in the first and second lists. However, in preferred embodiments, where the first list contains N symbols the method further comprises the steps of: initialising a count value to N; during each iteration of said steps (c) to (f), decrementing the count value by 1; and arranging for the final iteration of said steps (c) to (f) to be the iteration during which the count value is decremented to be equal to 1.
The structured listing of symbols may take a variety of forms. However, in preferred embodiments the structured listing represents a Huffmann tree, and the encoded data values for each symbol are determined from the Huffmann tree. Huffmann compression is a commonly used compression technique, and the technique of the present invention provides significant efficiency improvements in the generation of the Huffmann tree when compared with algorithms typically used in the prior art.
It will be apparent that when using the structured listing of symbols to determine encoded data values, it will typically be necessary to have a record of the two symbols represented by any particular new symbol. Hence, in preferred embodiments, each entry in the second list includes two child fields identifying the two symbols represented by the new symbol stored in that entry.
Viewed from a second aspect, the present invention provides a data processing system for generating a structured listing of symbols from which encoded data values for those symbols can be determined, the data processing system comprising: a list generator for generating from an input stream of symbols a first list having a plurality of entries, each entry identifying a symbol in the input stream and the frequency with which that symbol appears; a sorter for ordering the entries in the first list by frequency; a selector for selecting the two symbols having the lowest frequency; a new symbol generator, responsive to the selector, to generate a new symbol to represent the two selected symbols, and to allocate the new symbol a frequency based on the two selected symbols; the list generator being arranged to generate a second list for storage of new symbols generated by the new symbol generator, the list generator being arranged to store the new symbol as an entry in the second list along with an indication of the frequency allocated to the new symbol, the list generator further being arranged to make unavailable for subsequent steps in the generation of the structured listing the entries for the two symbols selected by the selector; and the selector and new symbol generator being arranged to repetitively represent the two symbols having the lowest frequency with a new symbol until only one available entry remains, each new symbol being stored in the second list, and in each iteration the selector being arranged to select the two symbols from all available entries in the first and the second list.
Viewed from a third aspect, the present invention provides an encoder for generating a structured listing of symbols from which encoded data values for those symbols can be determined, the encoder comprising: a list generator configured in operation to generate from an input stream of symbols a first list having a plurality of entries, each entry identifying a symbol in the input stream and the frequency with which that symbol appears; a sorter configured in operation to order the entries in the first list by frequency; a selector configured in operation to select the two symbols having the lowest frequency; a new symbol generator configured to be responsive to the selector to generate a new symbol to represent the two selected symbols, and to allocate the new symbol a frequency based on the two selected symbols; the list generator further being configured in operation to generate a second list for storage of new symbols generated by the new symbol generator, the list generator being arranged to store the new symbol as an entry in the second list along with an indication of the frequency allocated to the new symbol, the list generator further being arranged to make unavailable for subsequent steps in the generation of the structured listing the entries for the two symbols selected by the selector; and the selector and new symbol generator being arranged to repetitively represent the two symbols having the lowest frequency with a new symbol until only one available entry remains, each new symbol being stored in the second list, and in each iteration the selector being arranged to select the two symbols from all available entries in the first and the second list.
Viewed from a fourth aspect, the present invention provides a computer program product on a computer readable medium for operating a data processing system to generate a structured listing of symbols from which encoded data values for those symbols can be determined, the computer program product comprising: a list generator configured in operation to generate from an input stream of symbols a first list having a plurality of entries, each entry identifying a symbol in the input stream and the frequency with which that symbol appears; a sorter configured in operation to order the entries in the first list by frequency; a selector configured in operation to select the two symbols having the lowest frequency; a new symbol generator configured in operation to be responsive to the selector to generate a new symbol to represent the two selected symbols, and to allocate the new symbol a frequency based on the two selected symbols; the list generator further being configured in operation to generate a second list for storage of new symbols generated by the new symbol generator, the list generator being arranged to store the new symbol as an entry in the second list along with an indication of the frequency allocated to the new symbol, the list generator further being arranged to make unavailable for subsequent steps in the generation of the structured listing the entries for the two symbols selected by the selector; and the selector and new symbol generator being arranged to repetitively represent the two symbols having the lowest frequency with a new symbol until only one available entry remains, each new symbol being stored in the second list, and in each iteration the selector being arranged to select the two symbols from all available entries in the first and the second list.