The present invention relates to implementation of multipliers in programmable arrays, more particularly in a reconfigurable processor device.
A commercially successful form of reconfigurable device is the field-programmable gate array (FPGA). These devices consist of a collection of configurable processing elements embedded in a configurable interconnect network. Configuration memory is provided to describe the interconnect configurationxe2x80x94often SRAM is used. These devices have a very fine-grained structure: typically each processing element of an FPGA is a configurable gate. Rather than being concentrated in a central ALU, processing is thus distributed across the device and the silicon area of the device is used more effectively. An example of a commercially available FPGA series is the Xilinx 4000 series.
Such reconfigurable devices can in principle be used for any computing application for which a processor or an ASIC is used. However, a particularly suitable use for such devices is as a coprocessor to handle tasks which are computationally intensive, but which are not so common as to merit a purpose built ASIC. A reconfigurable coprocessor could thus be programmed at different times with different configurations, each adapted for execution of a different computationally intensive task, providing greater efficiency than for a general purpose processor alone without a huge increase in overall cost. In recent FPGA devices, scope is provided for dynamic reconfiguration, wherein partial or total reconfiguration can be provided during the execution of code so that time-multiplexing can be used to provide configurations optimised for different subtasks at different stages of execution of a piece of code.
FPGA devices are not especially suitable for certain kinds of computational tasks. As the individual computational elements are very small, the datapaths are extremely narrow and many of them are required, so a large number of operations are required in the configuration process. Although these structures are relatively efficient for tasks which operate on small data elements and are regular from cycle to cycle, they are less satisfactory for irregular tasks with large data elements. Such tasks are also often not well handled by a general purpose processor, yet may be of considerable importance (such as in, for example, image processing).
Alternative reconfigurable architectures have been proposed. One example is the PADDI architecture developed by the University of California at Berkeley, described in D. Chen and J. Rabaey, xe2x80x9cA Reconfigurable Multiprocessor IC for Rapid Prototyping of Real Time Data Pathsxe2x80x9d, ISSCC, February 1992 and A. Yeung and J. Rabaey, xe2x80x9cA Data-Driven Architecture for Rapid Prototyping of High Throughput DSP Algorithmsxe2x80x9d, IEEE VLSI Signal Processing Workshop, October 1992. Another alternative architecture is MATRIX, developed at the Massachussetts Institute of Technology and described in Ethan Mirsky and Andrxc3xa9 deHon, xe2x80x9cMATRIX: A Reconfigurable Computing Architecture with Configurable Instruction Distribution and Deployable Resourcesxe2x80x9d, FCCM ""96xe2x80x94IEEE Symposium on FPGAs for Custom Computing Machines, Apr. 17-19, 1996, Napa, Calif., USA, and in more detail in Andrxc3xa9 deHon, xe2x80x9cReconfigurable Architectures for General-Purpose Computingxe2x80x9d, pages 257 to 296, Technical Report 1586, MIT Artificial Intelligence Laboratory. The MATRIX structure has advantageous aspects, but the coarse grain size means that it consumes more silicon than a conventional FPGA structure and is likely to be less efficient for tasks which are regular from cycle to cycle. It would therefore be desirable to develop further reconfigurable structures which combine as best possible the advantages of both MATRIX and of conventional FPGAs.
A further development of the present applicants, described in European Patent Application No. 97310220.5, filed on Dec. 17, 1997 (with an overall architecture termed xe2x80x9cCHESSxe2x80x9d, and described in International Patent Application No. GB 98/00248, filed on Jan. 28, 1997) describes a reconfigurable device comprising: a plurality of processing devices; a connection matrix providing an interconnect between the processing devices; and means to define the configuration of the connection matrix; wherein each of the processing devices comprises an arithmetic logic unit adapted to perform a function on input operands and produce an output, wherein said input operands are provided as inputs to the arithmetic logic unit from the interconnect on the same route in each cycle, and wherein means are provided to route the output of a first one of the processing devices to a second one of the processing devices to determine the function performed by the second one of the processing devices.
In a preferred form of CHESS, each of the processing devices has a first operand input, a second operand input, a function result output, a carry input and a carry output, wherein the first operand input, the second operand input and the function result output are n-bit, where n is an integer greater than 1, and the carry input and the carry output are 1-bit. A particularly good design solution is found when n is equal to 4. The mechanism used for dynamic instruction is that each of the processing devices is adapted to receive, for determination of its function, an n-bit instruction input from another of the processing devices.
It is also advantageous that each of the processing devices contains a latchable output register for the function output. This is useful for constructing a xe2x80x9cdeepxe2x80x9d pipeline, where for example it is necessary to perform a number of operations in parallel and synchronise the provision of output from different ALUs.
A particularly important issue for all of the architectures described above is the implementation of a multiplier. Multipliers are a key element for many calculations, and many of the applications most suitable for use in an ASIC or coprocessor will contain large number of multiplication operations. A conventional approach to implementation of a multiplier is now described.
A combinational multiplier is commonly built as a repetitive array of core cells, in which each cell multiplies some bits (say M bits) from the multiplicand A with some bits (say N bits) from the multiplier B, to produce an (M+N) bit partial product. To allow a complete multiplier to be built, each core cell also needs to be able to add two additional inputs to the partial product, i.e. to compute the function ((A*B)+C+D). The D input is used to sum all of the partial products of the same significance, and the C input is used to add carries from less significant partial products. The (M+N) bit result from each core cell is divided into two portions:
1. The least significant M bits are fed to the D input of the adjacent core cell that produces a result of the same arithmetic significance.
2. The most significant N bits are fed to the C input of the adjacent core cell that produces a result M bits more significant.
The core cell of a 1 bit*1 bit multiplier can be implemented in one of three ways:
1. As two of 4-input lookup tables (LUTs), each with A, B, C, D as inputs, and each producing one of the two bits of output.
2. As a two input AND gate to compute (A*B), feeding a full adder that adds the result to C and D. This requires one 2-input LUT and two 3-input LUTs.
3. As a full adder to compute (A+C+D), feeding a multiplexer that selects either this result or D to be fed to the output, under the control of B.
Either of these solutions costs more resources than would be needed simply to perform the full addition. Multipliers are hence costly (in terms of silicon area, and hence in terms of actual cost) in FPGA structures. Any approach in devices of this general type which can increase the density of multipliers in a processing array will be highly advantageous in reducing cost.
Accordingly, the invention provides a method of multiplying a first number by a second number by use of an array of processing devices, each of said processing devices having a plurality of data inputs, a plurality of data outputs, and an instruction input for control of the function of the processing device, wherein said processing devices and an input for the first number and an input for the second number are interconnected by a freely configurable interconnect, and wherein each processing device calculates a partial product for multiplication of one or more bits of the first number with one or more bits of the second number, and for each processing device: the value received at the instruction input is determined by one or more bits of the first number; data inputs are provided by m bits of the second number, a summation input to sum all the partial products of the same significance, and, if appropriate, a carry input to add a carry from a less significant partial product; and data outputs are provided as a summation output containing the least significant m bits of the partial product and a carry output containing any more significant bits of the partial product.
The present invention relates to devices and architectures which include a freely connectable interconnect, where it is generally possible (except possibly for certain special cases) for any input and output to be connected to each other. The commonest form of architecture of this general type is a Field Programmable Gate Array (FPGA). Architectures in which certain limited choices of connection between inputs and outputs, or certain inputs and outputs, are available are not considered here to fall within the scope of the term xe2x80x9cfreely connectablexe2x80x9d.
It has been appreciated by the present inventors that for the conventional multiplier design, the required result from each core cell is either (A+C+D) or D, dependent on the value of B. (If B is zero, then (A*B) is zero, and also C becomes zero because B is also zero in the core cell that produces C). In other words a full adder would be sufficient to compute the required result, if its functionality could be controlled according to the value of B. In a conventional FPGA, the function of each logic unit is fixed at configuration time, and such data-dependent control of functionality is not possible. With the structure indicated above, such control is possible, and a dense multiplier is achieved.