A conventional processor (such as, for example, the Pentium II produced by Intel Corp.xe2x80x94Pentium is a trademark of Intel Corp.) is a general device. It is not optimised for any specific task, but is able to be programmed to perform a very wide range of functions.
The consequence of the general purpose architecture of the conventional processor is that for specific tasks, the performance of the processor will be much worse than for hardware designed to perform the specific tasks. This is because the architecture of the general purpose processor does not follow the structure of the task, but instead relies on a complex ALU (arithmetic logic unit) which is very heavily used during the task and which makes very frequent calls to its necessarily large memory resources. Where such tasks are computationally intensive, this approach is particularly inappropriate.
If there is a task which will be need to be performed on a regular basis, then an appropriate approach will be to provide circuitry optmisied specifically for that task. A typical approach is to provide such circuitry in the form of a co-processor or ASIC (application specific integrated circuit) together with the general-purpose processor, so that the tasks for which the co-processor or ASIC is optimised can be routed to the co-processor or ASIC by the general-purpose processor.
Although an ASIC may be optimal for a specific task, as it has been built for one specific task it will generally be poor or entirely non-functional for any other computational task. An advantageous possibility exists between the two extremes: on the one hand, a fixed configuration ASIC, and on the other hand, a conventional processor (for which a configuration in silicon can only be considered to exist for a single cycle). This intermediate possibility is a reconfigurable device: these have a determined configuration but allow for reconfiguration to a different determined configuration when required. Reconfigurable devices thus offer the possibility of a computer which can alter its hardware resources to service its current computational needs by appropriate reconfiguration.
A commercially successful form of reconfigurable device is the field-programmable gate array (FPGA). These devices consist of a collection of configurable processing elements embedded in a configurable interconnect network. Configuration memory is provided to describe the interconnect configurationxe2x80x94often SRAM is used. These devices have a very fine-grained structure: typically each processing element of an FPGA is a configurable gate. Rather than being concentrated in a central ALU, processing is thus distributed across the device and the silicon area of the device is used more effectively. An example of a commercially available FPGA series is the Xilinx 4000 series.
Such reconfigurable devices can in principle be used for any computing apposition for which a processor or an ASIC is used. However, a particularly suitable use for such devices is as a coprocessor to handle tasks which are computationally intensive, but which are not so common as to merit a purpose built ASIC. A reconfigurable coprocessor could thus be programmed at different times with different configurations, each adapted for execution of a different computationally intensive task, providing greater efficiency than for a general purpose processor alone without a huge increase in overall cost. In recent FPGA devices, scope is provided for dynamic reconfiguration, wherein partial or total reconfiguration can be provided during the execution of code so that time-multiplexing can be used to provide configurations optimised for different subtasks at different stages of execution of a piece of code.
FPGA devices are not especially suitable for certain kind of computational task. As the individual computational elements are very small, the datapaths are extremely narrow and many of them are required, so a large number of operations are required in the configuration process. Although these structures are relatively efficient for tasks which operate on small data elements and are regular from cycle to cycle, they are less satisfactory for irregular tasks with large data elements. Such tasks are also often not well handled by a general purpose processor, yet may be of considerable importance (such as in, for example, image processing). Alternative reconfigurable architectures have been proposed. One example is the PADDI architecture developed by the University of California at Berkeley, described in D. Chen and J. Rabaey, xe2x80x9cA Reconfigurable Multiprocessor IC for Rapid Prototyping of Real Time Data Pathsxe2x80x9d, ISSCC, February 1992 and A. Yeung and J. Rabaey, xe2x80x9cA Data-Driven Architecture for Rapid Prototyping of High Throughput DSP Algorithmsxe2x80x9d, IEEE VLSI Signal Processing Workshop, October 1992. This architecture was to the prototyping of high speed real-time DSP systems, DSP algorithms providing an example of computation not well handled either by conventional processors or FPGAs. The architecture comprises a plurality of relatively simple processing execution units connected by a reconfigurable network. Each execution unit operates at 16 bit width, has register files for the input operands, and has its own instruction memory. A 53 bit instruction word is necessary to specify the operation of an instruction unit.
In PADDI, instructions are distributed both at configuration and at run time. At configuration time, the memories, which act as control stores, are loaded with a set of instructions. At run time the addresses for all of the control stores are broadcast globally, and each of these local instruction memories retrieves its own local instruction for use by the local execution unit. In operation, communication between processing elements is data driven, and the processing elements act on data according to their local instructions.
Another alternative architecture is MATRIX, developed at the Massachussetts Institute of Technology and described in Ethan Mirsky and Andrxc3xa9 deHon, xe2x80x9cMATRIX: A Reconfigurable Computing Architecture with Configurable Instruction Distribution and Deployable Resourcesxe2x80x9d, FCCM ""96xe2x80x94IEEE Symposium on FPGAs for Custom Computing Machines, Apr. 17-19, 1996, Napa, Calif., USA, and in more detail in Andrxc3xa9 deHon, xe2x80x9cReconfigurable Architectures for General-Purpose Computingxe2x80x9d, pages 257 to 296, Technical Report 1586, MIT Artificial Intelligence Laboratory. MATRIX is a coarse-grained structure, in which an array of identical 8-bit functional units are interconnected with a configurable network. Each functional unit contains a 256xc3x978-bit memory, an 8-bit ALU with address able input registers, an output register and a multiplier, and control logic. This architecture is relatively versatile, as it provides the decentralisation of processing of an FPGA while providing a broader datapath and the scope to adjust the instruction stream to what is required for a given application.
The MATRIX structure has advantageous aspects, but the course grain size means that it consumes more silicon than a conventional FPGA structure and is likely to be less efficient for tasks which are regular from cycle to cycle. It would therefore be desirable to develop further reconfigurable structures which combine as best possible the advantages of both MARTIX and of conventional FPGAs.
Accordingly, the invention provides a reconfigurable device comprising: a plurality of processing devices; a connection matrix providing an interconnect between the processing devices; and means to define the configuration of the connection matrix; wherein each of the processing devices comprises an arithmetic logic unit adapted to perform a function on input operands and produce an output, wherein said input operands are provided as inputs to the arithmetic logic unit from the interconnect on the same route in each cycle, and wherein means are provided to route the output of a first cone of the processing devices to a second one of the processing devices to determine the function performed by the second one of the processing devices.
Unlike MATRIX, this approach involves no addressable input register (and hence no input register file), because input operands are provided from the interconnect on the same route in each cycle. This requires that individual processing devices are used as a part of a processing pipeline (conceivably it can return instructions to itself, but it will need to do this through the interconnect). An individual processing device in MATRIX is thus capable of a fuller range of function than an individual processing device in the reconfigurable device according to the invention. However, this is more compensated for by the increased number of processing devices for a given area of silicon.
The present approach also does not involve the sacrifice of considerable silicon area to form the control store memory needed for the PADDI architecture: this control store needs to be a significant size in PADDI, and the execution units of PADDI will be of much larger size than those of the present invention for equivalent functionality. The control store will also often be redundant in the PADDI architecture (if the execution unit is only require to perform the same instruction on every cycle). The requirement in PADDI that all control stores are addressed by a single global address prevents different parts of the machine being sequenced in data dependent ways, or operating on different threads of computation: in the PADDI arrangement, all the execution units must execute in synchronism.
It should be noted that input registers are not necessarily absent from architectures of this type: input registers which are not addressable are consistent with the invention (as input operands are still received on the same route in each cycle and the ALUs must be used in a processing pipeline). However, in a preferred embodiment none of the processing devices contains an input register of any kind, so input operands are received directly from the interconnect by the arithmetic logic unit.
The processing devices need configuration to perform appropriate functions, and at least some measure of dynamic instruction provision is to be provided. An advantageous solution is that each of the processing devices has a first plurality of configuration bits which can be determined by the output of another one of the processing devices and a second plurality of configuration bits which cannot be determined by the output of another one of the processing devices.
In a preferred embodiment, each of the processing devices has a first operand input, a second operand input, a function result output, a carry input and a carry output, wherein the first operand input, the second operand input and the function result output are n-bit, where n is in integer greater than 1, and the carry input and the carry output are 1-bit. A particularly good design solution is found when n is equal to 4.
In a preferred embodiment the mechanism for dynamic instruction is that each of the processing devices is adapted to receive, for determination of its function, an n-bit instruction input from another of the processing devices.
A further advantageous way to provide dynamic instruction is by provision of means to allow the carry input to one of the processing devices to change the function of the arithmetic logic unit of that processing device (for example to allow the carry input to change the function of the arithmetic logic unit to its logical complement). However, for versatile operation, it is also advantageous that means are provided for each of the processing devices to hold the carry input as a constant value. A further advantageous approach is for a first one of the processing devices to be usable to multiplex between two values of an instruction input to a second one of the processing devices according to the value of the carry input of the first of the processing devices, optionally also such that the carry input of the first of the processing devices can be propagated through the first of the processing devices to the carry input of the second of the processing devices.
It is also advantageous that each of the processing devices contains a latchable output register for the function output. This is useful for constructing a xe2x80x9cdeepxe2x80x9d pipeline, where for example it is necessary to perform a number of operations in parallel and synchronise the provision of output from different ALUs.
To allow an individual device to accept or reject dynamic instructions, it is desirable to provide for each of the processing devices a dynamic enable gate to determine whether instructions to determine the function of the arithmetic logic unit are to be accepted dynamically from the interconnect or are to be provided from configuration memory in the processing device. A further advantageous feature for each processing device is a dynamic instruction mask whereby application of the dynamic instruction mask to an instruction received by the processing device enables the instruction to provide both an instruction input to the arithmetic logic unit for determining the function of the arithmetic logic unit and an peripheral circuitry instruction input for control of peripheral circuitry in the processing device.