1. Field of the Invention
The invention relates generally to electronic logic circuits, and more particularly to systems and methods for processing data using an array of data processing elements that are coupled together using a variable precision interconnect.
2. Related Art
As computer technologies have advanced, the amount of processing power and the speed of computer systems has increased. The speed with which software programs can be executed by these systems has therefore also increased. Despite these increases, however, there has been a continuing desire to make software programs execute faster.
The need for speed is sometimes addressed by hardware acceleration. Conventional processors re-use the same hardware for each instruction of a sequential program. Frequently, programs contain critical code in which the same or similar sections of software are executed many times relative to most other sections in an application. To accelerate a program, additional hardware is added to provide hardware parallelism for the critical code fragments of the program. This gives the effect of simultaneous execution of all of the instructions in the critical code fragment, depending on the availability of data. In addition, it may be possible to unroll iterative loops so that separate iterations are performed at the same time, further accelerating the software.
While there is a speed advantage to be gained, it is not free. Hardware must be designed specifically for the software application in question. The implementation of a function in hardware generally takes a great deal more effort and resources than implementing it in software. Initially, the hardware architecture to implement the algorithm must be chosen based on criteria such as the operations performed and their complexity, the input and output data format and throughput, storage requirements, power requirements, cost or area restrictions, and other assorted criteria.
A simulation environment is then set up to provide verification of the implementation based on simulations of the hardware and comparisons with the software. A hardware target library is chosen based on the overall system requirements. The ultimate target may be an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other similar hardware platform. The hardware design then commences using a hardware description language (HDL), the target library, and the simulation environment. Logic synthesis is performed on the HDL design to generate a netlist that represents the hardware based on the target library.
While there are number of complex and expensive design tools employed throughout the process, frequent iterations are typically needed in order to manage tradeoffs, such as between timing, area, power and functionality. The difficulty of the hardware design process is a function of the design objectives and the target library. The continued advances in semiconductor technology continue to raise the significance of device parameters with each new process generation. That, coupled with the greater design densities that are made possible, ensures that the hardware design process will continue to grow in complexity over time.
This invention pertains to the implementation of algorithms in hardware—hardware that performs logic or arithmetic operations on data. Currently available methodologies range from using single processors, arrays of processors, either fixed (gate array) or field-programmable gate arrays (FPGA), or standard cell (ASIC) or full custom design techniques. Some designs may combine elements of more than one methodology. For example, a processor may incorporate a block of field programmable logic.
When comparing different implementations of programmable logic, the notion of granularity is sometimes used. It relates to the smallest programmable design unit for a given methodology. The granularity may range from transistors, through gates and more complex blocks, to entire processors. Another consideration in comparing programmable hardware architectures is the interconnect arrangement of the programmable elements. They may range from simple bit-oriented point-to-point arrangements, to more complex shared buses of various topologies, crossbars, and even more exotic schemes.
Full custom or standard cell designs with gate-level granularity and dense interconnects offer excellent performance, area, and power tradeoff capability. Libraries used are generally gate and register level. Design times can be significant due to the design flow imposed by the diversity of complex tools required. Verification after layout for functionality and timing are frequently large components of the design schedule. In addition to expensive design tools, manufacturing tooling costs are very high and climbing with each new process generation, making this approach only economical for either very high margin or very high volume designs. Algorithms implemented using full custom or standard cell techniques are fixed (to the extent anticipated during the initial design) and may not be altered.
The design methodology for fixed or conventional gate arrays is similar to that of standard cells. The primary advantages of conventional gate arrays are time-to-market and lower unit cost, since individual designs are based on a common platform or base wafer. Flexibility and circuit density may be reduced compared to that of a custom or standard cell design since only uncommitted gates and routing channels are utilized. Like those built with custom or standard cell techniques, algorithms implemented using conventional gate arrays are fixed and may not be altered after fabrication.
FPGAs, like conventional gate arrays, are based on a standard design, but are programmable. In this case, the standard design is a completed chip or device rather than subsystem modules and blocks of uncommitted gates. The programmability increases the area of the device considerably, resulting in an expensive solution for some applications. In addition, the programmable interconnect can limit the throughput and performance due to the added impedance and associated propagation delays. FPGAs have complex macro blocks as design elements rather than simple gates and registers. Due to inefficiencies in the programmable logic blocks, the interconnect network, and associated buffers, power consumption can be a problem. Algorithms implemented using FPGAs may be altered and are therefore considered programmable. Due to the interconnect fabric, they may only be configured when inactive (without the clock running). The time needed to reprogram all of the necessary interconnects and logic blocks can be significant relative to the speed of the device, making real-time dynamic programming unfeasible.
Along the continuum of hardware solutions for implementing algorithms lie various degrees of difficulty or specialization. This continuum is like an inverted triangle, in that the lowest levels require the highest degree of specialization and hence represent a very small base of potential designers, while the higher levels utilize more generally known skills and the pool of potential designers grows significantly (see Table 1.) Also, it should be noted that lower levels of this ordering represent lower levels of design abstraction, with levels of complexity rising in higher levels.
TABLE 1Designer bases of different technologies
There is therefore a need for a technology to provide software acceleration that offers the speed and flexibility of an ASIC, with the ease of use and accessibility of a processor, thus enabling a large design and application base.