1. Field of the Invention
The invention relates generally to arithmetic logic units. More specifically, the invention relates to arithmetic logic units capable of accumulating floating values.
2. Related Art
One of the most basic arithmetic operations utilized in information processing devices is that of accumulation of a series of values. Accumulation at its most basic is implemented by use of a memory, such as a register, and an adder device. For instance, FIG. 1 shows a multiply-and-accumulate (MAC) device 100 that accumulates products resulting from the multiplying of two input values. MAC device 100 includes a multiply unit 110 that provides the product of two input values as its output. MAC device 100 also includes an adder unit 120 that provides the sum of two values as its output. Each product resulting from multiply unit 110 is summed with an accumulated total stored in an accumulator 130. The output of the adder unit 120 is sent to the accumulator 130 which keeps a running total of the additions. One common problem with such basic configurations as MAC 100 is that a new product results from multiply unit 110 much faster than the time for the previous product to pass through the adder unit 120 and be accumulated. MAC 100 would work better if the adder unit 120 operated in only a single cycle. However, as is often the case, the adder units themselves are multi-cycle in nature and thus, if the inputs are to be fed in the proper time, delay elements must be introduced in the system.
FIG. 2 exemplifies the conventional use of a multi-cycle adder unit which is controlled by the introduction of intentional delay elements. The MAC 200 of FIG. 2 provides for multi-cycle adder latency by introducing a delay element 215. The multiply unit 210 provides the product of two input values to a delay element and to a MUX (multiplexor) 219. Multiplexor 219 selects between the output of the previous accumulate stored in memory MEM 218 and the current, non-delayed product result. The output of the delay element 215 forms the first of two inputs to another MUX 217. MUX 217 selects either the output of the delay element 215 or the accumulated value of a previous cycle stored in memory MEM 216. The output of MUX 217 and MUX 219 are summed by add unit 220, and then sent to memories 216 and 218.
FIG. 3 illustrates a conventional accumulator design using a tree adder. The adder-accumulator structure 300 uses an adder tree prior to the accumulate stage to compensate for the delay in the accumulate cycle. The tree adder takes inputs in groups of 4 on a given cycle. Two of the 4 inputs are summed by an adder unit 304 while the other two are summed by a parallel adder unit 306. The outputs of the adder 304 and the adder 306 are summed by an adder 310. The output of adder 310 is passed to the final adder unit 320 and summed with the accumulated value (of previous adds) stored in accumulator 330. The delay through the adders 304, 306 and 310 allow the accumulator to act in phase with the multiply units (not shown) providing the initial adds.
The adder-accumulator structure 300 is a balanced “tree” that works best if the inputs to be accumulated are in fixed groups of 4. If the inputs to be accumulated are less or more than 4 in number and not divisible thereby, then zeroes would need to be added through the adders as appropriate. This creates more adds than are necessary and slows performance. Further, if the inputs break in an uneven manner, where for instance, inputs belonging to one logical string are added by inputs of another logical string, the structure must stall the accumulator until the proper inputs for each string are completed.
Consider the summing and accumulation of 6 inputs, A, B, C, D, E and F. In a tree adder such as that illustrated in FIG. 3, the inputs A and B would be summed in one cycle, and in parallel the inputs C and D would be added. On the next cycle, the result of A+B and C+D would be summed together and sent to the adder 320. However, since there are actually a total of six inputs to accumulate, the inputs E and F would have to be added on the second cycle (when the results of the first cycle are being summed by adder 310) along with a pair of zeroes. This results in a extra number of additions than would ordinarily be required. If only three inputs A, B and C were to be accumulated together, then some logic would have to insert a zero to prevent input D from being summed along with A, B and C. Where the grouping of inputs is arbitrary and not well known, it is difficult using conventional designs to prevent slow results.
Worse still, the results provided by such adder-accumulator structures may be inaccurate. This occurs where the addition is not strictly associative in nature. Floating point add operations are examples of such operations that are not necessarily associative. This is due to the nature of floating point adds, which consider
FIG. 4 shows the add order of a balanced tree adder-accumulator. The results A+B, C+D and E+F can be performed in any order and even simultaneously since their outputs do not depend on each other. Thus at a layer 410 all of these preliminary adds can be performed. At the second layer 420, the result of A+B is summed with the result of C+D computed from layer 410. On layer 420, since E+F does not need to be summed yet with any other result, passes down to the third layer 430. The result of the layer 420 adds are summed with the E+F result of layer 410 at the third layer 430. The resultant of these adds is in the grouped order ((A+B)+(C+D))+(E+F). The single-cycle adder without delay elements would result in the grouped order of adds ((((A+B)+C)+D)+E)+F. When adding floating point values, the two methods may produce different results. While the single-cycle add method is cost-effective and slow, each add depending upon the result of the previous add, the tree adder method is more efficient in that dependencies are reduced.
In floating point add operations, the grouping of inputs that need to be accumulated may be constantly and arbitrarily changing. The first three inputs may need to be accumulated, then the next six, then the next two, then the next three and so forth. Since it is impractical to cover all such situations by restructuring a tree adder for each or by building a tree adder for each case, there needs to be a floating add solution that can make use of fixed structure that also preserves the order of adds and allows for arbitrary grouping thereof. Therefore, a method and apparatus for accumulating an arbitrary length string of inputs is desired which can minimize the number of adders while maintaining the correct order of operations.