Although processor speeds have been progressively increased, the need for increased computing power remains unabated. For example, smart phones now burden their processors with a bewildering variety of tasks. But a single-core processor can only accommodate so many instructions at a given time. Thus, it is now common to design systems with multi-core or multi-threaded processors that can process sets of instructions in parallel. But the resulting instruction-based architectures must always battle the limits imposed by die space, power consumption, and complexity with regard to increasing the instruction processing speed.
As compared to the use of a programmable processing core, there are many algorithms that can be more efficiently processed in dedicated hardware. For example, image processing involves substantial parallelism and processing of pixels in groups through a pipeline of processing steps. If the algorithm is then mapped to hardware, the implementation takes advantages of this symmetry and parallelism such that the processing speed is increased over processor-based architectures. But designing dedicated hardware is expensive and also cumbersome in that if the algorithm is modified, the dedicated hardware must be redesigned.
To provide an efficient compromise between instruction-based architectures and dedicated hardware approaches, a reconfigurable instruction cell array (RICA) architecture has been developed. FIG. 1A illustrates an example RICA system 50 having a reconfigurable core 1. In reconfigurable core 1, a plurality of instruction cells 2 such as adders (ADD), multipliers (MUL), registers (REG), logic operation shifters (SHIFT), dividers (DIV), data comparators (COMP), logic gates (LOGIC), and logic jump cells (JUMP) are interconnected through a programmable switch fabric 4. The configuration of instruction cells 2 with regard to their logical function or instruction they implement can be reprogrammed as necessary to implement a given algorithm or function. Switch fabric 4 would be reprogrammed accordingly as well. The plurality of instruction cells 2 include memory interface cells 12 that interface data for remaining ones of the instructions cells 2 as retrieved or loaded into a data memory 8. The resulting processing by instruction cells 2 occurs according to configuration instructions 10 obtained from a configuration RAM 6. A decode module 11 decodes instructions 10 to not only get the configuration data for instructions cells 2 but also for switching fabric 4. RICA 50 interfaces with external systems through I/O ports 16 and specialized instructions cell registers 14.
The instruction cells in a reconfigurable array may be arranged by rows and columns. An instruction cell, any associated register, and an associated input and output switching fabric for the instruction cell are denoted herein as a switch box. FIG. 1B illustrates an array of switch boxes arranged in rows and columns. A datapath formed between selected switch boxes is carried on channels selected from a plurality of channels. The channel routing is also arranged in rows and columns matching the rows and columns for the switch boxes. Each channel has a certain width in bits. The row directions may be considered to run east and west whereas the column directions run north and south. A datapath beginning in an instruction cell in an initial switchbox 100 routes on an output channel 101 in an east row direction. The routing for the datapath from subsequent switch boxes is in the appropriate east/west row direction or north/south column direction such that a final switch box 105 at some selected row and column position is reached. In this example data path, two instruction cells are configured as arithmetic logic units (ALUs) 110. The instruction cells for the remaining switch boxes are not shown for illustration clarity. Each switch box includes two switch matrices or fabrics: an input switch fabric to select for channel inputs to its instruction cell and also an output switch fabric to select for the channel outputs from the switch box. Referring back to FIG. 1A, switch fabric 4 represents the collection of each switch box's individual input and output switch fabrics.
The configuration of a switch box's instruction cell and switch fabrics occurs according to a configuration word received from configuration RAM 6. In this fashion, a RICA may be configured as necessary to perform a desired logical function or algorithm. For example, a RICA may be configured to perform an algorithm such as image processing that involves processing multiple pixels through a pipelined processing scheme. The desired algorithm can be mapped to instruction cells in a manner that emulates a dedicated hardware approach. But there is no need to design dedicated hardware, instead one can merely program the instruction cells and switching fabric as necessary. Thus, if an algorithm must be redesigned, a user may merely change the programming as necessary instead of having to redesign hardware. This is quite advantageous over traditional instruction-based computing approaches.
In contrast to an instruction cell, the logic block in a field programmable gate array (FPGA) uses lookup tables (LUTs). For example, suppose one needs an AND gate in the logic operations carried out in a configured FPGA. A corresponding LUT would be programmed with the truth table for the AND gate logical function. But an instruction cell is much “coarser-grained” in that it contains dedicated logic gates. In that regard, an ALU instruction cell includes assorted dedicated logic gates. It is the function of the ALU instruction cell that is configurable—its primitive logic gates are dedicated gates and thus are non-configurable. For example, a conventional CMOS inverter is one type of dedicated logic gate. There is nothing configurable about such an inverter, it needs no configuration bits. But the instantiation of an inverter function in a FPGA programmable logic block is instead performed by a corresponding programming of a LUT's truth table. Thus, as used herein, the term “instruction cell” refers to a configurable logic element that comprises dedicated logic gates.
Although a RICA offers robust advantages as compared to FPGA or dedicated processor architectures, challenges remain in its implementation. For example, it is conventional to arrange an array of switch boxes by rows and columns. The switching fabric in each switch box must then accommodate a data path that might begin at some row and column location and then end at some other row and column location. In this data path, an instruction cell such as an ALU performs its logical functions on one or more operands. An operand in this context is a received channel input. Depending upon its configuration bits, an ALU instruction cell is configured to perform corresponding logical operations. For example, a first switch box may include an ALU instruction cell configured to add two operands corresponding to two channel inputs. But the same ALU instruction cell may later be updated to subtract the two operands. The results from the logical operation within the instruction cell may be required in another instruction cell. Thus, the output switch fabric in the first switch box may be configured to drive the resulting data out of the first switch box through corresponding channel outputs. In contrast, an FPGA's LUTs produce a bit, they do not generate words. So the switch fabric in an FPGA is fundamentally different from the switch fabrics in a RICA in that an FPGA's switch fabric is configured to route the bits from the FPGA's LUTs. In contrast, the routing between switch boxes in a RICA is configured to route words as both input channels and output channels. For example, a switch box array may be configured to route 20 channels. Switch boxes in such an embodiment may thus receive 20 input channels from all four directions (east and west in the row directions, and north and south in the column directions) and drive 20 output channels in these four directions.
As a RICA performs a desired logical function or algorithm, it is often the case that buffers are necessary to store intermediate results. It would be challenging with regard to routing to enable every switch box to have the ability to directly read and write from one of buffers. To alleviate the routing demands, a subset of switch boxes are configured as master switch boxes that have this direct read and write access. This is not a severe limitation on the remaining non-master switch boxes since if these remaining switch boxes need to read or write data, the read input or write output is readily routed through the switch fabrics in a RICA to or from its master switch boxes. For example, suppose that switch box 105 of FIG. 1B is a master switch box and that switch box 100 is a non-master. Switch box 100 may then write to a buffer (part of data RAM 8 in FIG. 1A) through master switch box 105 using the routing through the intervening switch boxes as shown. Referring again to RICA 50 of FIG. 1A, memory interface cells 12 represent such master switch boxes whereas data RAM 8 represents an array of buffers.
It is desirable for each master switch box 12 to be able to write a data word to any of buffers 8 or read a word from any of buffers 8. To enable these interconnections, one solution would be to use place and route techniques (full synthesis). But the routing becomes very congested in such a case. Another critical issue is the testing of the buffers. A design-for-test (DFT) RICA implemented with full synthesis becomes unworkable. Moreover, even if the routing is enabled, the resulting testing is very slow.
Accordingly, there is a need in the for improved DFT features for RICA buffer testing.