Although processor speeds have been progressively increased, the need for increased computing power remains unabated. For example, smart phones now burden their processors with a bewildering variety of tasks. But a single-core processor can only accommodate so many instructions at a given time. Thus, it is now common to provide multi-core or multi-threaded processors that can process sets of instructions in parallel. But such instruction-based architectures must always battle the limits imposed by die space, power consumption, and complexity with regard to decreasing the instruction processing time.
As compared to the use of a programmable processing core, there are many algorithms that can be more efficiently processed in dedicated hardware. For example, image processing involves substantial parallelism and processing of pixels in groups through a pipeline of processing steps. If the algorithm is then mapped to hardware, the implementation takes advantages of this symmetry and parallelism. But designing dedicated hardware is expensive and also cumbersome in that if the algorithm is modified, the dedicated hardware must be redesigned.
To provide an efficient compromise between instruction-based architectures and dedicated hardware approaches, a reconfigurable instruction cell array (RICA) architecture has been developed. FIG. 1A illustrates an example RICA system 50 having a reconfigurable core 1. In RICA 50, a plurality of instruction cells 2 such as adders (ADD), multipliers (MUL), registers (REG), logic operation shifters (SHIFT), dividers (DIV), data comparators (COMP), logic gates (LOGIC), and logic jump cells (JUMP) are interconnected through a programmable switching fabric 4. The configuration of instruction cells 2 with regard to the logical function or instruction they implement can be reprogrammed as necessary to implement a given algorithm or function. Switching fabric 4 would be reprogrammed accordingly as well. Instruction cells 2 include memory interface cells 12 that interface data for remaining ones of the instructions cells 2 as retrieved or loaded into a data memory 8. The resulting processing by instruction cells 2 occurs according to configuration instructions 10 obtained from a configuration RAM 6. A decode module 11 decodes instructions 10 to not only get the configuration data for instructions cells 2 but also for switching fabric 4. RICA 50 interfaces with external systems through I/O ports 16 and specialized instructions cell registers 14. Additional features shown in FIG. 1A are described in U.S. Patent Publication No. 2010/0122105, filed Apr. 28, 2006, the contents of which are hereby incorporated by reference in their entirety.
It is conventional to arrange the instruction cells in a reconfigurable array by rows and columns. Each instruction cell, any associated register, and an associated input and output switching fabric for the instruction cell may be considered to reside within a switching box. FIG. 1B shows an example array of switch boxes arranged in rows and columns. A datapath formed between selected switch boxes is carried on selected channels from a plurality of channels. The channels are also arranged in rows and columns matching the rows and columns for the switch boxes. Each channel has a certain width in bits. The row directions may be considered to run east and west whereas the column directions run north and south. A datapath beginning in an instruction cell in an initial switch box 100 routes on an output channel 101 in an east row direction. The routing for the datapath from subsequent switch boxes is in the appropriate east/west row direction or north/south column direction such that a final switch box 105 at some selected row and column position is reached. In this example data path, two instruction cells are configured as arithmetic logic units (ALUs) 110. The instruction cells for the remaining switch boxes are not shown for illustration clarity. Each switch box includes two switch matrices or fabrics: an input switch fabric to select for channel inputs to its instruction cell and also an output switch fabric to select for the channel outputs from the switch box.
In contrast to an instruction cell, the logic block in a field programmable gate array (FPGA) uses lookup tables (LUTs). For example, suppose one needs an AND gate in the logic operations carried out in a configured FPGA. A corresponding LUT would be programmed with the truth table for the AND gate logical function. But an instruction cell is much “coarser-grained” in that it contains dedicated logic gates. For example, an ALU instruction cell would include assorted dedicated logic gates. It is the function of the ALU instruction cell that is configurable—its primitive logic gates are dedicated gates and thus are non-configurable. For example, a conventional CMOS inverter is one type of dedicated logic gate. There is nothing configurable about such an inverter, it needs no configuration bits. But the instantiation of an inverter function in a FPGA programmable logic block is instead performed by a corresponding programming of a LUT's truth table. Thus, as used herein, the term “instruction cell” refers to a configurable logic element that comprises dedicated logic gates.
An ALU instruction cell performs its logical functions on one or more operands. An operand in this context is a received channel input. Depending upon its configuration bits, an ALU instruction cell is configured to perform corresponding logical operations. For example, a first switch box may include an ALU instruction cell configured to add two operands corresponding to two channel inputs. But the same ALU instruction cell may later be updated to subtract the two operands. The operands that result from the logical operation within an instruction cell may be required in another instruction cell. Thus, the output switch fabric in the first switch box may be configured to drive the resulting operands out of the first switch box through corresponding channel outputs. In contrast, an FPGA's LUTs produce a bit, they do not generate words. So the switch fabric in an FPGA is fundamentally different from the switch fabrics in a RICA in that an FPGA's switch fabric is configured to route the bits from the FPGA's LUTs. In contrast, the routing between switch boxes in a RICA is configured to route words as both input channels and output channels. For example, a switch box array maybe configured to route 20 channels. Switch boxes in such an embodiment may thus receive 20 input channels from all four directions and drive 20 output channels in the four directions.
Note the advantages of a RICA: since the instruction cells comprise dedicated logic gates, the necessary amount of configuration data is substantially less than the configuration data for a comparable FPGA. The switch boxes may thus be readily reconfigured over a relatively brief delay such that the reconfiguration is effectively real-time to a companion processor. In contrast, the massive amount of configuration data for an FPGA requires considerable delay for its loading into the FPGA. A RICA also has processing speed advantages as compared to software-based implementations in a traditional processor. For example, an algorithm such as image processing that involves processing multiple pixels through a pipelined processing scheme can be mapped to instruction cells in a manner that emulates a dedicated hardware approach. But there is no need to design dedicated hardware. Instead one can merely configure the instruction cells and switching fabrics as necessary. Thus, if an algorithm must be redesigned, there is no need for hardware redesign but instead a user may merely change the configuration data. This is quite advantageous over traditional instruction-based computing approaches.
Although a RICA thus offers robust advantages, challenges remain in its implementation. For example a number of configuration bits are required for configurable elements within each switch box such as for the configuration of the instruction cell and switching fabrics. Each switching box thus requires storage elements for storing its configuration bits. In one example embodiment, an array of twenty rows and twenty columns (resulting in 400 switch boxes) requires 77 kilobits for its configuration. The circuitry for the loading of so many configuration bits consumes valuable die space and power. In addition, a RICA requires a minimum latency for the loading of the configuration bits. In that regard, an instruction cell is not statically programmed in a RICA—for example, an instruction cell can be reconfigured several times during normal operation. It may not need such frequent reprogramming but the capability should be provided. Since other systems such a microprocessor may be interfacing with a RICA, the latency of the reconfiguration must be minimized to prevent stalls.
Accordingly, there is a need in the art for area-efficient and low-latency configuration schemes for reconfigurable instruction cell arrays.