The following discussion of the background of the present invention refers to multiplication. However, because division can be performed as a series of subtractions, just as multiplication can be performed as a series of additions, the discussion can be related to division with minor modification.
A digital processor hardware designer faces two choices when he or she is required to include a multiplication function in his or her design: The function can be performed (1) by a hardware multiplier or (2) by software or micro-code using a conventional ALU (arithmetic and logic unit). In most cases, the ALU already exists in the design and is used to support other functions. On the other hand, a hardware multiplier is very complex and expensive in terms of gate count and board space required. Therefore, a hardware multiplier is not a popular choice unless the design requires very high performance and is not cost sensitive. Using a conventional ALU with micro-code or software help to perform one-bit-at-a-time multiplication is a very common practice. See Chapter 3 of "Digital Computer Arithmetic" by Cavanagh, 1984, McGraw-Hill, Inc., for multiplication algorithms. Chapter 4 of the same book has similar algorithms for division.
The size of the multiplier must be known before performing a one-bit-at-a-time multiplication function. In most designs, the size of the multiplicand must also be known so that the two input operands and the output result word can be placed in two registers instead of three, thereby increasing the speed of the multiplication operation. For example, the following algorithm is used for multiplication:
Step 1: Initialize the result word to zero and set the current bit pointer to point to the MSB of the multiplier.
Step 2: Shift the result word one bit to the left, effectively multiplying it by two.
Step 3: If the current bit is a one, the multiplicand is added to the result word.
Step 4: If the current bit is the LSB of the multiplier, go to step 5. Otherwise, move the current bit pointer one bit position toward the LSB in the multiplier and go to step 2.
Step 5: This is the end of the multiplication--the result word contains the product of the multiplication.
In this procedure, steps 2 to 4 must be performed for each bit of the multiplier. These three steps typically will take several micro-code instructions to perform. If each of the instructions takes one clock to execute, the cost of multiplication will be several clocks per bit.
The Unisys Corporation (the present invention is assigned to Unisys Corporation, Blue Bell, Pa.) Common Input Output Module (CIOM) Sequencer uses an enhanced ALU to allow a special "multiply-assist" instruction to perform steps 2 through 4 in one clock. However, even at this one-clock-per-bit speed, a 20-bit wide multiplier still needs 20 clocks to complete a multiplication function, excluding the initial overhead of setting up the operands and the current bit pointer. Although this is the fastest possible speed with a one-bit-at-a-time multiplication algorithm, it is still very slow compared to a one-clock addition or subtraction function supported by the same CIOM Sequencer.
The dominant limitation on the speed of a one-bit-at-a-time multiplication operation is the size of the multiplier (in bits). The present invention was made in the process of attempting to make the multiplication operation faster by reducing the size of the multiplier. A multiplier is usually acquired from a storage device somewhere in the design or obtained as the result of a previous calculation. A micro-code programmer usually knows the largest possible size of this operand. Its exact size is unknown and can vary each time the same code stream is executed. Algorithms for determining the exact size of an operand are not much faster than processing the bits through the multiplication algorithm described above. It typically takes one clock or more to eliminate each leading zero bit. Therefore, micro-code programmers assume the worst case multiplier size when implementing a multiplication function.
U.S. Pat. No. 4,615,016, Sep. 30, 1986, titled "Apparatus For Performing Simplified Decimal Multiplication By Stripping Leading Zeroes," discloses a BCD (binary coded decimal) multiplication unit that uses firmware to compute significant digits. The patent discloses that a counter register is initially loaded with the maximum width of an operand (e.g., 16 bits), and then a microinstruction loop is executed to scan the operand one digit at a time starting from the most significant digit. Each trip through the loop checks a single digit for zero and the counter is decremented by one if a zero is detected. The loop is repeated for each digit until a non-zero digit is encountered or there are no digits left to check. At the end of this process, the counter register contains the number of significant digits. This process is relatively slow, its speed being inversely proportional to the number of leading zeros in the operand.