1. Field of the Invention
The present invention relates, in general, to improving processing efficiency in reconfigurable hardware. More specifically, the invention relates to optimizing the compilation of algorithms on reconfigurable hardware to reduce the time required to execute a program on the hardware.
2. Relevant Background
As microprocessors continue to increase rapidly in processing power, they are used more often to do computationally intensive calculations that were once exclusively done by supercomputers. However, there are still computationally intensive tasks, including compute-intensive image processing and hydrodynamic simulations that can take significant amounts of time to execute on modern microprocessors.
Paralleling the progress of microprocessors, reconfigurable hardware such as field programmable gate arrays (FPGAs) has made advances both in terms of increased circuit density as well as ease of reprogramming, among other areas. Originally developed as simple logic for interconnecting components on digital systems, reconfigurable hardware has become so easy to reprogram that it can be used today as reconfigurable logic for executing a program.
A number of advantages may be realized when the reconfigurable hardware can be reprogrammed to meet the needs of individual programs. For example, the reconfigurable hardware may be programmed with a logic configuration that has more parallelism and pipelining characteristics than a conventional microprocessor. Also, the reconfigurable hardware may be programmed with a custom logic configuration that is very efficient for executing the tasks assigned by the program. Furthermore, dividing a program's processing requirements between the microprocessor and the reconfigurable hardware may increase the overall processing power of the computer.
Unfortunately, an important stumbling block for users who may wish to take advantage of reconfigurable hardware is the difficulty of programming the hardware. Conventional methods of programming reconfigurable hardware included the use of hardware description languages (HDLs); low-level languages that require digital circuit expertise as well as explicit handling of timing.
Progress has been made in the development of technology for compiling conventional high-level languages to reconfigurable hardware. However, existing compilers that compile the algorithms written in these high-level languages still benefit from optimization to get the reconfigurable hardware to process data in the most efficient way possible.
One performance limit comes from the time required when reconfigurable hardware reads data elements from a source array in memory located outside the hardware. This limit is observed when, for example, a compute-intensive algorithm consists of loops that operate over a multi-dimensional source array located outside the reconfigurable hardware, where each iteration of a loop computes on a rectangular sub-array or stencil of the source array.
For example, in a conventional windowed loop the elements of the source array are stored in a memory external to the reconfigurable hardware and are accessed by the hardware at a rate of one cell value per clock cycle. Thus, when the windowed array is a 3×3, two-dimensional array, nine clock cycles are needed to read the nine values of the array into the reconfigurable hardware. If the source array is a two-dimensional array of size Si×Sj, and the windowed array is size Wi×Wj, then the number of clock cycles needed to run the loop may be represented as:(Si−(Wi−1))×(Sj−(Wj−1))×(Wi×Wj)+Lwhere L is the pipeline depth of the loop.
Accordingly, significant efficiencies can be realized by reducing the number of times that a data element from outside the reconfigurable hardware has to be reread into the hardware. Moreover, efficiencies can be realized by eliminating intermediate steps in processing the data that involve writing data to memory outside the reconfigurable processor and later reading the data back into the hardware.