This invention relates to a pipeline active filter of the type commonly used as a convolver or correlator for image enhancement, data filtering, correlation, pattern extraction, Synthetic Aperture Radar (SAR) data processing, and the like.
These applications for an active filter can all use the same general group of convolution operations, namely summation of the weighted values of an input data stream of picture elements (pixels) representing (usually) a two-dimensional image. Weighting is accomplished by multiplying each pixel value by a set of 35 by 35 weighting factors, for example, to create a new output value for each pixel.
In conventional processing, using a standard digital computer, or an array processor, the data are processed sequentially, using a repetition of multiplication and summation operations on each pixel value. Thus, on an image of 1000 by 1000 pixels, filtered by a 35 by 35 weight mask (a realistic requirement), the data must be accessed and multiplied 1000.times.1000.times.35.times.35 times, to produce one full image. This amount of processing is obviously very slow and expensive, and thus greatly limits application of the convolution processing. When compared to the speed of acquisition of the data, even from a spacecraft transmitting slow-scan television frames, the disparity is seen to be great.
One solution is to provide more than one multiplier, and to process the data in a pipeline fashion, thereby arranging to hold the input data stream access requirements to a minimum. If the process were embodied in dedicated VLSI hardware, rather than in software (computer program), this solution could be more readily accomplished, and produced in quantities at a reasonable cost. In processing a 1000 by 1000 pixel image by a kernel of 35 by 35 weights, each pixel need only be accessed once and multiplied by all weights simultaneously, with the result that the entire image processing operation requires only 1000 by 1000 successive accesses--a saving of 1225 to one. (For two-dimensional convolution the incoming pixels are delayed for one image line length between each row of the kernel).
One approach suggested by Professor Carver Mead of California Institute of Technology to the present invention is to use the modular algorithm described by Danny Cohen, "Mathematical Approach to Computational Networks," Information Sciences Institute, U.S.C. ISI/RR-78-73 ARPA Order No. 2223, November 1978. That algorithm is diagrammed in FIG. 1. The pixel data (typically 8 bits per pixel) is input at .times. for multiplication by weights W.sub.1, W.sub.2, W.sub.3 and W.sub.4. Each section adds the new product (temporary product) with the output sum, S, of the previous section, indicated by a plus sign in a circle. The product sum from the previous section is passed through a unit time delay Z. Note that no delays are needed in the "upper" line.
In a digital processor, multiplication is carried out by a succession of additions, and when carried out in a digital computer using binary arithmetic, carry operations usually take up most of the operating time. Consequently much effort has gone into the design of addition/carry algorithms and circuits to reduce the carry time in digital processors. The need to propagate the carry can be made to occur less frequently than the remaining internal addition operations if some additions are carried out with carry-save operations so that only one carry-propagate operation occurs per several addition operations. The digital processor will then be more efficient.
A suitable logic design for the multipliers is shown in FIGS. 2a and 2c which can be implemented in VLSI chips as described by Rodney T. Masumoto in a thesis for an Electrical Engineer degree at California Institute of Technology, May 18, 1978. The logic design implements the special case of ternary multiplication often referred to as a modified Booth algorithm summarized in the following truth table, wherein the columns headed Y.sub.i+1, Y.sub.i and Y.sub.i-1 represent three successive bits of a multiplier, and the respective notations 1.times. and 2.times. mean one times and two times a multiplicand.
______________________________________ BOOTH ALGORITHM TRUTH TABLE Y.sub.i+1 Y Y.sub.i-1 ADD SUB 1X 2X ______________________________________ 0 0 0 1 0 0 0 ADD ZERO 0 0 1 1 0 1 0 ADD 1X 0 1 0 1 0 1 0 ADD 1X 0 1 1 1 0 0 1 ADD 2X 1 0 0 0 1 0 1 SUB 2X 1 0 1 0 1 1 0 SUB 1X 1 1 0 0 1 1 0 SUB 1X 1 1 1 0 1 0 0 SUB ZERO ______________________________________
The column on the right is an interpretation of the operation to be executed in view of the outputs in the four columns headed ADD, SUB, 1.times. and 2.times..
A constant shift of two bits of the multiplier Y occurs between examinations of the multiplier bit sets Y.sub.i+1, Y.sub.i, Y.sub.i-1. After each shift, the logic looks at the present two multiplier bits Y.sub.i and Y.sub.i+1 and the previous bit Y.sub.i-1. (In conventional multipliers, the multiplier bits are examined one at a time). The multiplication action controlled by the logic diagram of FIG. 2a through the logic diagram of FIG. 2c allows merely shifting or not shifting under 2.times. or 1.times. control, and inverting or not inverting under ADD or SUB control, the multiplicand bit to be added in a carry-save adder shown in FIG. 3a before examining the next bits of the multiplier (pixel).
The logic circuit for decoding the set of three multiplier (pixel) bits shown in FIG. 2a may be implemented with FET NOR gates as shown in FIG. 2b. The logic requires an exclusive-OR gate 1 to form the command 1.times.=Y.sub.i .sym.Y.sub.i-1, add or subtract the multiplicand, depending on whether Y.sub.i+1 is a bit 0 or a bit 1, and an exclusive-OR gate 2 followed by an AND function gate 3 to form the command 2.times.=(Y.sub.i .sym.Y.sub.i-1)(Y.sub.i+1 .sym.Y.sub.i) add or subtract twice the multiplicand, depending on whether Y.sub.i+1 is a bit 0 or a bit 1. The logic symbols employed are conventional, with a small circle at the output signifying an inverting logic element.
It should be noted that if a pixel is to be multiplied by a set of weights simultaneously by an array of multipliers using the modified Booth decoder, only one decoder is required, but a separate shifter/-inverter circuit is required for each weight. Such a circuit defined by the logic diagram in FIG. 2c can be implemented with FET devices as shown in FIG. 2d.
The advantage of the circuits in FIGS. 2b and 2d is that they can be implemented with n-MOS integrated circuit techniques for a very large scale integrated (VLSI) circuit, together with the carry-save adder of FIG. 3, but the main advantage of this modified Booth decoder and shifter/inverter for multiplication, as used in the present invention, is that it substantially reduces the number of addition operations, and the time required for those operations. That is because carrys are saved until a final product sum is to be formed in a full adder, at which time carry propagation is allowed while the next pixel is being multiplied by the same weight. FIG. 4 illustrates a full adder which can be implemented with n-MOS integrated circuit techniques as described by Masumoto, supra.
An entire array of multiplier/adder circuits tends to defeat the desire for providing many multipliers in a small area, but if each of the multipliers uses only one row of adders in the add-shift manner of the modified Booth algorithm, much space is saved, and the operation will still be much faster than a conventional computer process. This speed is made possible by a carry-save adder at each multiplier position implemented as shown in FIG. 3 and described by Masumoto, supra. The following is the truth table of the carry-save adder.
______________________________________ Si X Ci So Co ______________________________________ 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 1 ______________________________________
While the truth table of the carry-save (half) adder is identical to that of a full adder, there are subtle circuit and operational differences. A full adder propagates carries; a carry-save adder defers the carry propagation to the next adder cycle. Thus, there is no carry assimilation delay.
FIG. 3b illustrates a conventional logic diagram of an adder that uses two exclusive-NOR gates. FIG. 4 illustrates an equivalent logic diagram using only NOR gates which, although it requires more gates, can be implemented more easily with n-MOS integrated circuit techniques. Both of these may be used to implement a carry-save adder or a full adder. The difference is only in how they are used. In a carry-save adder, the carry and sum outputs C.sub.o and S.sub.o are both saved in storage devices, and the carry is not propagated. Instead, both the carry and sum are added to a new bit, indicated as X in FIG. 3b, during a following bit multiplication cycle. In a full adder shown in FIG. 4, there is only one output, S.sub.o. The carry, C.sub.o, is not an output of the adder; instead it is an internal signal that is propagated as C.sub.i to a stage of higher order. The carry from the next lower adder is shown as C.sub.i in FIG. 4. The other two inputs, S and C, are stored sum and carry bits from a carry-save adder.
Both the carry-save adder and the full adder thus implemented in VLSI chips are used by Masumoto, supra, differently. In a 16 by 16 bit multiplier accumulator using a modified Booth decoder to sum successive products, Masumoto uses eight cascaded carry-save adders and one final adder with carry propagation, a full adder. The present invention utilizes only one carry-save adder for forming a product by successive additions of the multiplicand without carry propagation, and a full adder to assimilate the separately stored carry with the sum for a binary number sometimes referred to hereinafter as a "temporary product." The full adder is then used a second time to add this temporary product to a sum of products, thus allowing successive products to be summed in pipeline multiplying units while processing a stream of data, such as pixels in successive lines of a stored frame of pixels.
The carry-save adder as used in this invention accepts three signals; the local binary product X, a carry signal C.sub.i+1 from the next more significant adder, and a sum signal from the second next more significant digit carry-save adder S.sub.i+2, respectively, where the carry and sum have been calculated during the previous clock cycle so that in the logic diagram of FIG. 3a the inputs C.sub.i and S.sub.i for the outputs S.sub.o and C.sub.o are actually C.sub.i+1 and S.sub.i+2. In contrast, a full adder immediately adds the carry that is produced to the sum of its next more significant neighbor to produce the correct number.
Since the correct product must ultimately be produced, both carry-save adders and full adders are required by the present invention, but as just noted above, only one carry-save adder and one full adder is needed for each product binary digit so a unit to multiply an 8-bit pixel with a 16-bit weight requires an array of 25 multipliers (24 bits+carry), or for a truncated product, a lesser number, such as 22, each multiplier being comprised of a common modified Booth decoder, shifter-inverter, carry-save adder and full adder. These multiplier units use a triple bit examination approach to reduce multiply/add operations (and circuits) to half of the conventional equivalent. Together with the reduced ratio of carry operations to sum operations, this operates to make the process quite fast in this active filter.