1. Field of the Invention
This disclosure relates generally to computer processors, and in particular to a circuit for performing a pipelined divide operation for small operand sizes.
2. Description of the Related Art
Computer processors typically have special units for handling arithmetic operations. The most difficult of the four traditional arithmetic operations tends to be division. Typically, divide operations are long latency, low throughput operations. Often a divide unit is built to handle large operands, such as 53-bit mantissa operands from a double-precision floating point number as defined by the IEEE 754 standard. The result of the divide operation will usually be available after a large number of cycles determined by the length of the input operands.
Typically, there is a minimum overhead involved in performing a divide operation, so that if a divide unit handles large operand sizes, the small operand sizes will still have a long latency, even though the latency potentially could be reduced. Additionally, if there is only one divide unit for a particular processor, multiple threads may be sharing the same divide unit, leading to long delays if one thread is waiting for divide operations from another thread to finish.
There are a variety of different ways to implement a divider, and one such way is though the use of a subtractive algorithm. In such an approach, a divider may be configured to iteratively produce a quotient from a dividend (i.e., a numerator) and a divisor (i.e., a denominator) by performing a sequence of shift, subtract, and compare operations, similar to standard long division. Subtractive division algorithms may generally be characterized by the following equation:Pj+1=rPj−qj+1D where Pj denotes the partial remainder, r denotes the radix of the algorithm, D denotes the divisor, and qj+1 denotes the quotient digit corresponding to the partial remainder generated by a given iteration of the algorithm. Successive partial remainders may be generated by multiplying a previous partial remainder by the radix and then subtracting the product of the selected quotient digit and the divisor. For example, the divider may be configured to implement a restoring division algorithm in which the quotient digits ‘q’ are selected from the set {0, 1}. As indicated by the above equation, the quotient digit is an input that determines the next partial remainder.
If a divider were limited to small operand sizes, and the divide operation were completed in only a few cycles, it would allow for a considerable improvement in the latency and efficiency of the divider as compared to a divider that has to handle large operand sizes. Furthermore, if the architecture were pipelined, new dividend and divisor input operands could be applied to the divider on each instruction cycle instead of waiting for each instruction to finish. Therefore, what is needed is a way to perform a divide operation on small operand sizes using a pipelined architecture, to reduce the latency and increase throughput of the divide unit.
In view of the above, improved circuits for performing a divide operation on small operand sizes are desired.