1. Field of the Invention
The present invention relates to arithmetic processing and calculating using an electrical computer.
2. Description of the Prior Art
Field Programmable Gate Arrays (FPGAs) have high computational density (e.g., they offer a large number of bit operations per unit space-time) when they can be run at high throughput. To achieve this high density, designs must be aggressively pipelined to exploit the large number of registers in FPGA architectures. In the extreme, designs are pipelined so that only a single Look Up Table (LUT) delay and local interconnect is in the latency path between registers. Pipelined at this level, conventional FPGAs should be able to run with clock rates in the hundreds of megahertz.
Pipelining may always be performed for acyclic designs (feed forward dataflow). It may be necessary to pipeline the interconnect, but the transformation can be performed and automated.
However, when a design has a cycle which has a large latency but only a few registers in the path, pipelining to this limit cannot be immediately performed. No legal retiming will allow reduction of the ratio between the total cycle logic delay (e.g., the number of LUTs in the path) and the total registers in the cycle. This often prevents pipelining the design all the way down to the single LUT plus local interconnect level and consequently prevents operation at peak throughput to use the device efficiently.
The device may be used efficiently by interleaving parallel problems in C-slow fashion, but the throughput delivered to a single data stream is limited. In a spatial pipeline of streaming operators, the throughput of the slowest operator will serve as a bottleneck, forcing all operators to run at the slower throughput, preventing achievement of high computational density.
The use of associative reduce trees on modulo arithmetic (including modulo addition and modulo accumulation) to introduce parallelism into accumulations, reducing the time required to accumulate numbers, is known.
The use of parallel prefix to compute a series of partial intermediate partial sums in a modulo addition with only a constant factor more operators than the associative reduce tree, which can produce the final sum, is known.
The use of delayed addition to perform a modulo accumulation step in constant (O(1)) time, being prepared for the next input, is known.
As a result, modulo accumulation admits to area-time tradeoffs which allow spending area (parallelism) to increase the throughput of accumulation (handle more input values per unit time). Modulo accumulation can be performed arbitrarily fast compared to the raw speed of the gates.
The use and value of saturated addition to keep accumulator widths low while accumulating data which may overflow (or underflow) the accumulator width is known.
Saturated accumulation, however, is often a slow operation limiting clockrates on designs. Saturated accumulation is a common signal processing operation with a cyclic dependence which prevents aggressive pipelining. As such, it can serve as the rate limiter in streaming applications.
Hitherto, it has been believed that saturated accumulation is “not” an associative operation and hence the associative transformation techniques for trading increased area for reduced time (increased throughput) which worked for modulo addition (e.g., associative reduce, parallel prefix, delayed addition) will not directly apply to saturated accumulation.
P. I. Balzola, M. J. Schulte, J. Ruan, J. Glossner, and E. Hokenek, in “Design Alternatives for Parallel Saturating Multioperand Adders,” Proceedings of the International Conference on Computer Design, September 2001, pp. 172-177, attacked the problem of saturating accumulation at the bit level. They observed they could reduce the logic in the critical cycle by computing partial sums for the possible saturation cases and using a fast, bit-level multiplexing network to rapidly select and compose the correct final sums. They were able to reduce the cycle so it only contained a single carry-propagate adder and some bit-level multiplexing. For custom designs, this minimal cycle may be sufficiently small to provide the desired throughput although it may not be suitable when the designer has less freedom to implement a fast adder and must pay for programmable interconnect delays for the bit-level control.
Many operations would benefit from being able to use fast saturated accumulation.
Therefore, techniques and methods are needed that provide fast saturated accumulation. There is a need for techniques and methods that allow performing an area-time tradeoff for saturated accumulations, allowing the spending of additional area to increase the throughput of bit-accurate saturated accumulation at fixed gate speeds.
Saturated accumulation is an example of an operation with a loop carried dependency. A loop is a sequence of statements which is specified once but which may be performed several times in succession (iterations). A loop carried-dependency results when an iteration of a loop computes a value that is required by a subsequent iteration of the loop. In general, there is a need for techniques and methods that permit increased speed of processing loops with loop-carried dependencies, such as but not limited to saturated accumulation.
There is a need for techniques and methods that allow performing an area-time tradeoff for loops with loop-carried dependencies, allowing the spending of additional area to increase the throughput at fixed gate speeds.