This invention relates generally to methods and apparatus for counting the number of set bits in data words in data processing systems.
Many data processing applications require the number of set bits, i.e. bits of value xe2x80x9c1xe2x80x9d, in a data word to be counted. Often this operation must be repeated many times, as in graphics or cryptography applications for example, so the process can be a time consuming one.
Various methods for counting the number of set bits in data words are known. One method is to check each bit of the data word in turn and keep a running total of the number of set bits found by incrementing a counter when the checked bit is a xe2x80x9c1xe2x80x9d. Checking the bits in turn requires a time varying linearly with the length of the word, and, assuming the process branches to increment the counter only if the checked bit is set, the total time for the operation is dependent on the content of the word and thus varies from word to word. Another method is to clear the least significant set bit of the word using the technique x=x and (xxe2x88x921), where xe2x80x9candxe2x80x9d signifies the logical AND operation, and keep a count of the number of times that a set bit has been cleared. The process continues until the word is all zeros at which point the count is returned. The time taken by this process is again content dependent, varying according to the number of set bits in the word. The branches required can be slow, and unrolling the loop to avoid branches would substantially increase the code length. A further method is to use a lookup table to count bits in part of the word and then sum the results from the lookup operations for the individual parts. In particular, a lookup table may be provided which indicates the bit count for each possible value of a nibble, byte or 16-bit string. For 16-bit strings for example, a table of 65536 count values, indexed by string value, is provided. Each count value gives the set-bit count for the associated string value. For words longer than sixteen bits, each 16-bit string is looked up and the resulting count values are summed to obtain the total set-bit count. This type of process involves memory references for accessing the lookup table which can be slow on some system architectures and requires a lot of memory. Large lookup tables may not fit into processor memory caches so storage fetches can be expensive in terms of processor time.
According to one aspect of the present invention there is provided a method of counting the number of set bits in an n-bit data word in a data processing system, the method comprising:
(a) generating at least p1 intermediate n-bit words, where 1 less than p1 less than n, by masking and shifting the data word such that each intermediate word has n/p1p1-bit fields and n of the fields of the intermediate words represent the values of respective bits of the data word;
(b) summing the intermediate words to generate an n-bit derivative word having n/p1p1-bit fields whereby each field of the derivative word represents the sum of a respective group of bits of the data word; and
(c) summing the fields of the derivative word by performing shift, add and mask operations on thee derivative word, whereby the resulting sum value represents the number of set bits in said data word.
In accordance with the present invention, therefore, mask and shift operations are initially performed on the n-bit data word to generate at least p1 intermediate n-bit words where p1 is an integer between 1 and n. Each intermediate word has n/p1p1-bit fields and n of the fields of the intermediate words represent the values of respective bits of the data word. For example, for n=32 and p1=2, two intermediate 32-bit words may be generated each having sixteen 2-bit fields. Each of the thirty-two 2-bit fields of the intermediate words then indicates the value of a respective one of the thirty-two bits of the input data word. Thus, a set bit in the data word is represented by the 2-bit field 01, and an unset bit in the data word is represented by the 2-bit field 00. When the intermediate words are then summed in step b above, a derivative 32-bit word having sixteen 2-bit fields is obtained where each 2-bit field represents the sum of two bits of the data word. These fields are then summed by performing further shift, add and mask operations in step c above. The resulting sum value is therefore the sum of all bits in the original data word which thus gives the number of set bits in the word. In embodiments of the present invention therefore, the number of set bits in a data word can be counted simply by performing a series of shift, add and mask operations which are relatively inexpensive processor operations. Since branching or lookup operations are not required, the process can be performed more quickly than prior methods discussed above. Further, the processing time is independent of the content of a data word, so for a given word length the set-bit count can be obtained in a fixed time which is of substantial benefit to real time applications. Moreover, the process lends itself particularly well to implementation by processors with multiple concurrent execution units, allowing further speed gains to be achieved where such processors are provided.
For the sake of efficiency it is preferred that p1 intermediate words are generated in step a above, though methods can be envisaged where more than p1 intermediate words are generated in which case there will inevitably be more than n p1-bit fields overall, the additional fields in the n-bit words then being zero-value fields.
Where n is exactly divisible by p1, each intermediate word will of course have an integral number of p1-bit fields. However, the value of p1 need not be selected such that n is exactly divisible by p1, and in such cases each intermediate word, and hence the resulting derivative word, will have a fractional p1-bit field, ie. a partial field having less than p1 bits. At least one of these fractional or partial fields of the intermediate words will represent the value of one bit of the data word. The process is unaffected by the presence of such partial fields since in subsequent shift and add operations the xe2x80x9cmissing bitsxe2x80x9d will simply be treated as zeros.
The shift, mask and add operations in step c above can be performed in various ways to sum the fields of the derivative word, but in preferred embodiments step c comprises the steps of:
(c1) masking and shifting the derivative word to generate at least pi intermediate n-bit words, where 1 less than p1 less than n and i=2 in a first pass, each having n/xi xi-bit fields, where xi=pi*xixe2x88x92l and x1=p1, n/xixe2x88x921 of which fields represent the values of respective fields of the derivative word;
(c2) summing the intermediate words from step c1 to generate a new n-bit derivative word having n/xixi-bit fields each of which represents the sum of a respective group of bits of the data word; and
(c3) iteratively performing steps c1 and c2 for the new derivative word with i=i+1 in each successive pass, at least until a new derivative word with xi-bit fields is obtained where 2xi greater than n.
In such embodiments, therefore, the process is generally an iterative one, steps c1 and c2 corresponding generally to steps a and b above although the value of pi may be different in successive steps. As will be apparent from the detailed examples described below, the purpose of generating the intermediate words in step c1 with larger bit-fields than the last-generated derivative word is to allow groups of these fields to be summed in step c2 without overflow, ie. without the resulting sum bit-strings exceeding the field size. In some cases, however, the last-generated derivative word may be such that the maximum possible value of the sum of at least two of its fields will not overflow the existing field size of that derivative word. In such cases steps c1 and c2 can be simplified to reduce the number of processing operations required. In particular, in any pass of step c1 for a derivative word having xixe2x88x921-bit fields, if the maximum possible value V of the sum of qxe2x89xa72 fields of the derivative word satisfies V less than 2xixe2x88x921, steps c1 and c2 may be modified to:
summing fields of the derivative word in groups of q fields by performing shift and add operations on the derivative word and masking the resulting word to eliminate redundant fields, thereby generating a new derivative word having xi-bit fields each of which represents the sum of respective groups of bits of the data word. The shift, add and masking operations in modified steps c1 and c2 may be performed such that the new derivative word is an n-bit word having n/xi xi-bit fields, where xi=q*xixe2x88x921. (The multiplication operator is represented by the symbol xe2x80x9c*xe2x80x9d.) Alternatively, the shift, add and masking operations in modified steps c1 and c2 may be performed such that xi=xixe2x88x921 and the new derivative word has nxe2x80x2=n/q bits. For this latter case, in step c3 and subsequent passes of steps c1 and c2 n is replaced by nxe2x80x2. Examples of these cases will be explained in more detail below.
According to step c3 above, steps c1 and c2 (in their modified or unmodified forms as appropriate) may be iteratively performed just until a new derivative word is obtained with xi-bit fields satisfying 2x1 greater than n. Here, the field size xi is sufficiently large so that, for any possible value of the original n-bit word, all the fields of the derivative word can be summed without overflow. That is, a bit string of xi-bits is sufficient to represent the value of the sum of n set bits. Thus, when such a derivative word is obtained, the xi-bit fields may simply be summed by performing shift and add operations, whereby said resulting sum value indicating the set-bit count is obtained. The shift and add operations performed in this step may be implemented by multiplication of the derivative word as will be illustrated below. Alternatively in step c3, steps c1 and c2 may be iterated until a new derivative word is obtained with one field of xi=n bits, whereby said one field represents said resulting sum value.
It is to be appreciated that, where features are described herein with reference to a method of the invention, corresponding features may be provided in accordance with an apparatus embodying the invention, and vice versa. In particular, the present invention also provides a data processor comprising logic configured to perform a method as described above, and a data processing system including such a data processor.