Computing devices are becoming ubiquitous and many electronic devices can now be found amongst the objects carried by people in their everyday life: mobile phones, personal digital assistants, portable audio players.
These objects have been enabled by embedded processors which follow the same computing paradigm known as von Neumann's architecture. As embedded devices become more complex they require faster and faster clock frequencies and consume more and more power. This is because conventional processors execute instructions sequentially and fetch data also sequentially. For battery powered devices the von Neumann computing paradigm cannot be sustained and alternatives must be found.
Recently there has been great interest in more parallel architectures to face the demanding computational needs of multimedia and communications algorithms. Application specific integrated circuits (ASICs) have been used to increase the number of operations done in parallel in critical parts of the algorithms, thus avoiding increasing the clock frequency and therefore keeping the energy consumption within practical limits. However, ASICs have long development times, and once fabricated they cannot be changed. This is incompatible with fast changing market dynamics and the short lifespan of modern electronics.
Programmable solutions are in effect more desirable and this is how the technology of Reconfigurable Computing came into existence. A reconfigurable computer is a machine whose architecture can be changed at post-silicon time by changing the contents of configuration memories. The essential element of a reconfigurable computer is a programmable multiplexer (FIG. 1). The programmable multiplexer has inputs A and B, an output C and a configuration bit S. If S is set to 0 a path is created from A to C; if S is set to 1 a path is created from B to C. Having enough programmable multiplexers enables functional units and memory elements to be interconnected at will, creating different hardware architectures on-the-fly, for better executing different algorithms. The present invention is a template for deriving a class of reconfigurable architectures.
Existing reconfigurable architectures can be divided in two main kinds: (1) fine-grained arrays and (2) coarse-grain arrays.
Fine-grain arrays have gained widespread popularity in the form of Field Programmable Gate Arrays (FPGAs). An FPGA is a large array of small programmable functional units for performing logic functions on narrow bit slices interconnected by a large network of programmable switches. The functional units are essentially programmable Look-Up-Tables (LUTs) and the network of switches consists of the programmable multiplexers described above. Commercial FPGA devices are available through companies like Xilinx, Altera, Actel, Lattice, etc. Although FPGAs enable creating circuits on demand by electrical programming, the rich array of LUTs and routing switches represent a huge area and power penalty: the same circuits implemented in dedicated hardware would be much smaller and less energy hungry. Therefore, the use of FPGAs in battery operated devices has been the exception rather than the rule.
FPGAs have been combined with standard processors and specific blocks such as multipliers and embedded RAMs in order to mitigate the huge circuit areas required and improve performance. In this way, only the more specific and critical parts of the algorithms are run on the reconfigurable fabric, whereas other less critical parts are run on the embedded processors. Examples of such hybrid architectures have been proposed by some researchers [4-16-11] and introduced in the market by FPGA vendors. However, these circuits are still wasteful in terms of silicon area and slow in terms of clock frequencies and configuration times.
Coarse-grain arrays overcome the mentioned limitations of fine-grain arrays at the cost of reduced flexibility and generality. Coarse-grain arrays have been the object of recent research with quite a few architectures being proposed by researchers [3-6-12-5-7-8-10-9-13-2-14-17] and startup companies [18-19]. These arrays have functional units of higher granularity and less complex interconnection networks to better target DSP applications such as multimedia and communications. The functional units normally perform arithmetic and logic operations on words of a few bytes rather than on slices of a few bits. The result is a less general but much more compact and faster reconfigurable system, requiring small amounts of configuration data, which can be agilely and partially swapped at run time.
Another important aspect is how reconfigurable units are coupled with embedded microprocessors. Initially reconfiguration began at the processor functional unit level, and was triggered by special instructions [15-12-1]. Later, reconfigurable units became coprocessors tightly coupled with processors and still requiring special instructions in order to work [4-16-3-6-12]. More recently, coprocessors attached to system busses and requiring no extensions of the host processor instruction set have become a major research topic [2-7-17]. Our work fits into the latest category.
The work in [2] presents a self-timed asynchronous data-driven implementation, which, given the difficulties of the timing scheme adopted, needed a full custom silicon implementation, somewhat impractical to use in a standard cell based technology. The architecture features two address generation processors, which run microcode instructions to create the needed sequence of memory addresses.
The architecture in [7] uses undifferentiated 8-bit functional units, including LUT-based multipliers, which are difficult to scale to 16-bit or 32-bit data words used in most multimedia and communications applications. The hierarchical interconnection scheme is structured enough to facilitate compilation. However, this work represents a single architecture design rather than an architecture template adaptable and scalable for various applications.
The work closest to ours is the one described in [17]: an architecture template consisting of an array of coarse-grain functional units interconnected to a set of embedded memories and address generation modules. The address generation modules are implemented with cascaded counters which feed a series of arithmetic and logic units (ALUs) and multipliers for the generation of complex address sequences. A set of delay lines synchronize the control of functional units and memory operations.
In our approach the address generation blocks are implemented with programmable accumulators, reducing the complexity of the hardware compared to using ALUs and multipliers. Instead of multiplying delay lines for synchronization, we use a single delay line and multiple counters with programmable wrap around times to generate groups of enable signals with different delays. In this way, the generation of some addresses can be delayed relatively to others, enabling the execution of loop bodies expressed by unbalanced pipeline graphs. The enable signals accompany the data signals through each functional unit, so they arrive with the needed delay at the next functional unit.
Our approach explicitly structures the interconnection networks (partial crossbars) to facilitate the operation of our programming tool. In fact the architecture template and the programming tool have been co-designed to avoid creating hardware structures whose programming is difficult or intractable to automate.
We also consider data sources and data sinks which are not necessarily the data inputs and outputs of embedded processor. The origin and destination of the data may be any piece of hardware in the system, not necessarily synchronous to the system clock. For that purpose we provide an interface simpler than processor busses, and we use asynchronous FIFOs to connect the core to other cores running at a different clock speed.