Known digital processors for convolution and correlation implemented as bit-level systolic arrays are described in British Patent Application No. 2106287A published on Apr. 7, 1983 (Reference 1), which corresponds to U.S. Pat. No. 4,533,393, and U.S. Pat. No. 4,639,857, in which FIGS. 15 to 20 refere to a convolver. This device consists of a rectangular array of gated full adder logic cells arranged in rows and columns. Each cell is connected to its immediate row and column neighbours only, ie each cell is connected to a maximum of four other cells. Cell operation is controlled by clocked latches which effect movement of data, coefficient, carry and sum bits through the array. Each cell evaluates the product of an input data bit and an input coefficient bit received from left and right hand neighbors respectively, and adds the product to input carry and cumulative sum bits received from the right and above respectively. New carry and cumulative sum bits are generated for output to the left and below, and the input data and coefficient bits pass to the right and left respectively. Each coefficient word is circulated bit serially through a respective array row. Each data word passes through each row in succession and in effect spirals (strictly speaking zig-zags) up the array. Successive carries move with coefficient bits, and successive cumulative sums move down array columns. Data moves in counterflow with respect to both cumulative sum generation and coefficient and carry propagation. Cumulative sum generation is cascaded down array columns to produce partial sums output from the array. Partial sums of like bit significance emerge from the same array column in succession, and are accumulated to form convolution results by full adders arranged for output sum feedback.
It is a disadvantage to the use of processors described in Reference 1 that data and coefficient words must be interspersed with zero bits to avoid generation of unwanted partial products. At any time, at least half and in one case three quarters of the array cells compute zero partial products. The array is therefore inefficient, and much larger than would be required if interspersed zero bits could be avoided.
A further bit-level systolic array is described in British Patent Application No. 2144245A published Feb. 27, 1985 (Reference 2), which corresponds to U.S. Pat. No. 4,686,645. This relates to an array similar to that of Reference 1 for multiplying two matrices having multi-bit coefficients. This provides for row elements of one matrix to propagate along array rows in counter-flow with column elements of the other, carry bits being recirculated on each cell rather than moving along rows. The use of so-called "guard bands" is described, this being the extension of coefficient words with zero bits to provide for word growth of accumulating results.
British Patent Application No. 2147721A published May 15, 1985 (Reference 3), which corresponds to U.S. Pat. No. 4,701,876 relates to a further bit-level systolic array for matrix-vector multiplication. Improved array efficiency is obtained in two ways. Firstly, array output accumulation is arranged such that parts of the array corresponding to inactive regions in Reference 1 contribute to convolution results. Secondly, the need for zeros between data and coefficient bits is avoided by complex clocking arrangements effecting bit movement in adjacent rows on alternate clock cycles. As in References 1 and 2, multiplicand bits move in counterflow in array rows. As in Reference 2, carry bits are recirculated on each cell and word extension with guard bands is employed.
In the GEC Journal of Research, Vol. 2, No. 1, 1984, R. B. Urquhart and D. Wood introduce the concept of using static coefficients in bit-level systolic arrays. Each cell of an array is associated with a respective single bit of a coefficient, and a coefficient word is associated with a corresponding array row. The cells are arranged for carry bit recirculation, data is input to each array row and moves along it. Cumulative sum generation is cascaded down array columns and guard bands provide for word growth. Partial sums of like bit significance emerge from different array columns either with relative delays or synchronously according to whether input data meets coefficient bits in ascending or reverse order of bit significance. This arrangement provides 100% cell utilisation or array efficiency without requiring complex clocking arrangements.
Each cell computes products on every clock cycle, and all latches are clocked in the same way. Unfortunately, however, array accumulation as described cannot provide correct convolution or correlation results, since the scheme proposed would produce wrongful accumulation of partial sums and carry bits corresponding to different results.
In the art of digital arithmetic circuits, it is important to provide for standardisation of components if at all possible. This is greatly facilitated if integrated circuits designed for small calculations can be linked together or cascaded in an array to perform a much larger calculation. It is also important, although very rarely achievable, to provide for some degree of fault tolerance in such an array of integrated circuits, in order that a comparatively small fault might not render the array entirely useless. This is of particular importance in the developing field of wafer scale integration, in which wafer yields can be virtually zero without some degree of fault tolerance.
It is an object of the present invention to provide a digital processor for convolution or correlation capable of being cascaded to form a fault tolerant assembly.
The present invention provides a digital processor suitable for performing an N point convolution or correlation of coefficient words of q bits with data words guard band extended to p bits, and wherein:
(1) the processor includes an array of N rows and q columns of one-bit logic cells;
(2) each logic cell is arranged to:
(a) input data, carry and cumulative sum bits, PA1 (b) output the data bit, PA1 (c) produce output cumulative sum and carry bits corresponding to addition of the input cumulative sum and carry bits to the product of the input data bit with a respective stationary coefficient bit, and PA1 (d) recirculate the output carry bit on the respective cell to provide an input carry bit to a succeeding computation;
(3) cell interconnection lines are arranged to provide for propagation of data bits along array rows and for cascading cumulative sum generation down array columns;
(4) the cell interconnection lines include clock-activated first delaying means arranged to effect data bit movement along array rows at the same rate as cumulative sum bit movement down array columns;
(5) the array has a first row, first column cell for data input to the array word and bit serially, least significant bit leading;
(6) last column cells in the first to (N-1)th rows each have a data bit output connected via respective second delaying means to a data bit input of a first column cell in a respective adjacent or second to Nth row to provide for data to move along successive rows down the array, the first and second delaying means being arranged to delay successive data bits as appropriate for cascading cumulative sum generation down each array column; and
(7) the processor includes accumulating means arranged to add cumulative sums of like bit significance output from different array columns and comprising contributions to like convolution or correlation terms.
The processor of the invention has the advantage of being cascadable to form a device suitable for performing a larger convolution or correlation calculation. Cascading may be achieved simply by connecting last row data and cumulative sum outputs of one array to corresponding first row inputs of another. In this case, guard bands require additional extension to accommodate formation of larger partial sums. Furthermore, a cascaded assembly of processors of the invention exhibits a degree of fault tolerance. If one processor in the assembly is faulty, it can be bridged out by connecting data and cumulative sum outputs from one neighbouring processor to corresponding inputs of the other via clock-activated latches. This permits fault tolerant designs in which a cascaded assembly includes additional or redundant processors brought into play in the event that the array includes faulty devices. The bridging latches avoid the penalty of reduction in operating speed arising from long bypass lines. Such latches cannot be employed in prior art processors in which data and results propagate in counterflow, since they would disrupt calculation timing.
The logic cell array produces partial sums each arising from a number of bit-level partial products. These partial sums are accumulated in accordance with bit significance by the accumulating means, which is preferably arranged to sum contributions to successive convolution or correlation results in separate adding means. This avoids errors due to addition of terms attributable to different results. In the case of a processor arranged for input data to meet coefficient bits in descending order of significance, the accumulating means comprises a plurality of adder chains each with a two clock cycle delay between adjacent adders. For input data meeting coefficient bits in ascending order of bit significance, a plurality of adder trees is employed. Switching between adder trees or chains may be effected by the use of means responsive to a least significant bit flag to indicate emergence of partial sums attributable to a new convolution or correlation result from the array. The switching means may include a D-type flip-flop arranged for clocking by successive least significant bit flags. Each flag changes two flip-flop output states to effect switching.
When arranged to compute an N-point convolution or correlation, each data bit delaying means provides a delay of (1+log.sub.2 N+p-q) clock cycles, where non-integer values of log.sub.2 N are rounded up to a whole number.
In a preferred embodiment, a processor of the invention includes a switchable accumulator with each adder chain connected to a first input of a respective full adder arranged for carry recirculation. Each full adder has a second input connected via delaying means to a respective convolution sum input, and also has an output connected to a convolution or correlation sum output of the processor. In this embodiment the processor may be cascaded as part of a chain of like processors connected output to neighbouring input. Each processor is then suitable for accumulating a respective part of a convolution or correlation result. Successive parts are appropriately delayed and added together in the aforesaid full adders. Cascading accordingly only requires passage of results, data and least significant bit flags via delays to successive processors. This avoids the need for inter-column connection between adjacent processors. To provide for fault tolerance without operating speed penalties, the processor may include latched bypass lines for results, data and lsb flags, the bypass lines being implemented in the event of the processor being faulty.