Floating-point accumulation arithmetic is an important operation of floating-point calculation, which is extensively applied in such fields as process control and digital signal processing. Although previous floating-point arithmetic system is generally implemented with the help of universal floating-point processor or digital signal processor (DSP), and is characterized by advantages like relatively well-established technologies, optimal implementation tools and simple programming, the processor may often yield to such phenomenon as Cache Miss during calculation due to limitations on its internal structure. This may affect calculation performance of the system. The design based on the universal processor and DSP technologies can only maintain the continuous calculation performance at 10%-33% of the peak value, which is unlikely to obtain higher calculation performance.
In recent years, FPGA technologies have witnessed an accelerated development, which has been transformed from the preliminary application of pure logic substitution to the complicated application of intensive calculation. The newly launched FPGA instruments comprise a large number of DSP units, block RAM (Block RAM and BRAM) and RocketIO GTP receiver unit used for high-speed serial communication in addition to integration of abundant of configurable Logic Block (CLB). Meanwhile, to facilitate debugging of FPGA, FPGA manufacturers have also developed testing tools for on-chip logic analysis (such as ChipScope as developed by Xilinx) that make it possible to implement high-performance calculation on the FPGA in terms of both hardware and software. In the aspect of floating-point arithmetic, FPGA is being increasingly applied owing to its flexible configuration and low power consumption.
A floating-point adder inside FPGA is usually realized with the help of logic resources or configurable DSP module. To obtain a higher arithmetic speed, the floating-point adder usually requires flow lines as many as 10 levels, which may result in higher latency to the output of floating-point adding results. Therefore, FPGA based floating-point accumulator of a conventional design often proceeds with adding for different levels in a sequence from a lower level to a higher level; adding results at each level is to be stored in the internal buffer before being applied to the follow-up arithmetic. In this way, when some accumulations are equivalent to or even below the level of flow line of the floating-point adder, the adder might be at the idle status as the duration for adding flow line to complete an operation is longer than the data input time. This may result in significant latency to the output of final accumulation results as compared with input of original data. Under such circumstance, input of original data on follow-up floating-point accumulation is only available when previous floating-point accumulation is nearly completed, which may result in significant latency. On some occasions with higher real-time requirements, such accumulator is unable to satisfy application demands. Despite of the fact that such problem can be solved by providing more floating-point adders, consumption of FPGA logic resources or DSP module will witness a dramatic increase due to the complexity of floating-point arithmetic.