The present invention relates generally to computer microarchitecture designs, and, more particularly, to a shared parallel adder tree for executing multiple different population count operations on a single datum.
One common task in digital computing is to count the number of binary “1”s in a string or packet of bits (i.e., “datum”). Such population count operations are important for various digital applications, including communication, encryption, decryption, voice recognition, encoding and many others. It is also important that the population count operations take place at a relatively fast rate, so as to not undesirably slow down the entire digital computing system.
Known population counters are implemented with carry-save adder (CSA) devices arranged in a tree configuration. Carry-save adders are used instead of full adders also common in prior art population counters. This is because CSAs are much faster than full adders as CSAs do not propagate the carries throughout the entire instruction execution as full adders do. Propagating the carries with full adders adds a relatively large amount of time for the entire instruction to execute. In contrast, a CSA stores the carry as a separate part of the binary output value of the CSA, with the other part of the binary output value of a CSA being the partial sum. This allows some computer microarchitecture designs to execute a population count instruction in a single CPU cycle.
However, as computers trend towards increasingly wider data widths (e.g., 64 bits versus 32 bits), designing the corresponding computer microarchitecture on the microchip or integrated circuit (IC) to achieve the desired fast speeds of instruction execution is becoming increasingly difficult and challenging. Also, with the wider data word widths it is desired to perform population count operations not only on the entire 64-bit word, but also on portions of the entire word (i.e., on smaller bit-words, for example, 8, 16 or 32 bit sub-words). Typically, it has been necessary to design a separate or dedicated parallel adder tree into the computer microarchitecture for each desired population count operation. That is, a parallel adder tree is not “shared” by the different population count operations. This leads to an inefficient usage of the microchip area.
What is needed is single parallel adder tree that allows portions of the tree, or “subtrees”, to be shared in order to perform or execute multiple, different population count operations on a single datum, thereby providing for a relatively smaller area on the microchip to be taken up by the population count circuitry, faster operation in carrying out the multiple population count operations, and overall relatively lower power usage by the microchip.