Special-purpose architectures have long been used to achieve higher performance at a lower cost than general-purpose processors. But as general-purpose processors have become faster and cheaper, special-purpose architectures have been relegated to a shrinking number of special applications. By definition, an application-specific architecture speeds up only one application. This inflexibility, combined with a high design cost, makes special-purpose architectures unattractive except for very well-defined and widespread applications like video processing and graphics.
Configurable computing was developed as an attempt to reverse this trend. The goal of configurable computing is to achieve most of the performance of custom architectures while retaining most of the flexibility of general-purpose computing. This is done by dynamically constructing a custom architecture from an underlying structure of configurable circuitry. Although the concept of configurable computing is very attractive, success has been remarkably hard to achieve in practice.
Most current custom computing machines are constructed from field-programmable gate arrays (FPGAs). These FPGAs contain logic blocks that can be configured to compute arbitrary functions, and configurable wiring that can be used to connect the logic blocks, as well as registers, together into arbitrary circuits. Because FPGAs deal with data at a single-bit level, FPGAs are considered fine-grained. The information that configures an FPGA can be changed quickly so that a single FPGA can implement different circuits at different times. FPGAs would thus appear to be ideally suited to configurable computing.
Unfortunately, the fine-grained circuit structure, which makes them so general, has a very high cost in density and performance. Compared to general-purpose processors (including digital signal processors), which use highly optimized functional units that operate in bit-parallel fashion on long data words, FPGAs are somewhat inefficient for performing logical operations and even worse for ordinary arithmetic. FPGA-based computing has the advantage only on complex bit-oriented computations like count-ones, find-first-one, or complicated bit-level masking and filtering. Depending on the circuit being constructed, this cost/performance penalty can range from a factor of 20 for random logic to well over 100 for structured circuits like arithmetic logic units (ALUs), multipliers, and memory. Further, programming an FPGA-based configurable computer is akin to designing an application-specific integrated circuit (ASIC). The programmer either uses synthesis tools that deliver poor density and performance or designs the circuit manually, which requires both intimate knowledge of the FPGA architecture and substantial design time. Neither alternative is attractive, particularly for simple computations that can be described in a few lines of a high-level language. Thus, custom computing based on FPGAs is unlikely to compete on applications that involve heavy arithmetic computation.
Other known custom computing machines are constructed from systolic arrays that implement systolic algorithms. For example, see U.S. Pat. No. 4,493,048. Programmable systolic arrays are a form of configurable computing wherein the individual processing elements are programmed to perform the computation required by the systolic algorithm. This allows a programmable systolic array to execute a variety of systolic algorithms. Using programmable processing elements to achieve flexibility is expensive and limits the performance that can be achieved. The program for the processing element must be stored in the element and executed. This execution involves fetching the next instruction and using it to perform the appropriate operation. The computational structure of the element is fixed and must be able to execute all the operations required by the algorithms to be executed. This flexibility means that the structure does not perform the operations in the most efficient way. For example, if a complex operation is to be performed on a single systolic cycle, the complex operation must be broken down into several instructions requiring several clock cycles. In a hard-wired array, the operation could be performed in a single clock cycle by using circuitry capable of executing the entire operation.
In a variety of pipelined computations, there is a certain dataflow that is constant over the entire computation and certain dataflow that changes during the computation. Therefore, it would be desirable to factor out the static computation and generate an instruction set for the remaining dynamic control. However, applying dynamic control to a systolic array would entail a prohibitive amount of computational overhead for generating and implying dynamic control signals. Therefore, there is an unmet need in the art for a configurable computing architecture that can be reconfigured with a minimum amount of dynamic control.