1. Field of the Invention
The present invention generally relates to multipliers for digital computing and control systems and, more particularly, to a low-latency two's complement bit-serial multiplier with high throughput.
2. Description of the Prior Art
There has been increasing interest over the last few years in the bit-serial approach to digital integrated circuit (IC) design. The major advantages which this approach offers are the limited interconnect requirement and the small size of computational elements compared with those needed for parallel computation. The bit-serial approach was advocated by Leland B. Jackson, James F. Kaiser and Henry S. McDonald in their article entitled "An Approach to the Implementatio of Digital Filters", IEEE Transactions on Audio and Electroacoustics, vol. AU-16, no. 3, September 1968, pages 413 to 421, as offering savings in routing and computational hardware as compared with parallel architecture. It has further been demonstrated by a number of silicon compilers that it is also particularly suited to automatic chip generation. See, for example, Peter Denyer and David Renshaw, VLSI Signal Processing, A Bit-Serial Approach, Addison-Wesley (1985); Jeffrey R. Jasica, Sharbel Noujaim, Richard Hartley, and Michael J. Hartman, "A Bit-Serial Silicon Compiler", Proceedings of the IEEE International Conference on Computer-Aided Design, pages 91 to 93 (1985); Fathy F. Yassa, Jeffrey R. Jasica, Richard I. Hartley, and Sharbel E. Noujaim, "A Silicon Compiler for Digital Signal Processing: Methodology, Implementation and Applications", Proceedings of the IEEE, Special Issue on Hardware and Software for Digital Signal Processing, vol. 75, No. 9, September 1987, pages 1272 to 1282; and Rajeev Jain, Francky Catthoor, Jan Vanhoof, Bart J.S. De Loore, Gert Goossens, Nelson F. Goncalvez, Luc J.M. Calesen, Johan K.J. Van Ginderdeuren, Joos Vandewalle, and Hugh J. De Man, "Custom Design in a VLSI PCM-FDM Transmultiplexer from System Specifications to Circuit Layout Using a Computer-Aided Design System", IEEE Journal of Solid State Cicuits, vol. SC-21, no. 1, February 1986, pages 73 to 85.
The area saving achieved by bit-serial computation is particularly important in the case of multipliers, since parallel multipliers can use up large amounts of chip area. Using bit-serial multipliers allows the designer to place a large number of multipliers on a single chip. In order to describe the characteristics which a bit-serial multiplier should have for use in a bit-serial design, it is necessary to describe the bit-serial architecture. In a bit-serial circuit, data flows from one computational element to another along serial lines. The steady stream of bits is divided into words of a fixed number of bits in length. Arithmetic data values are represented in two's complement format and are passed least significant bit first. Separate words of data follow each other directly with no idle bits separating them. Each computational element receives a synchronized control signal (if needed) to indicate to it where one word ends and the next starts. This signal may be synchronized with the most significant bit (signbit) of each word. Since data flows least significant bit first, the signbit is the last bit of the word. The other obvious alternative is to synchronize the control signal with the first bit of each word. The control signal is typically used for such purposes as resetting registers and carriers.
Each operator in a bit-serial circuit has a fixed latency which is the number of cycles that elapse between the time that the first bits of input arrive and the first bit of the corresponding output is available. Since each operator may have a different latency, it is necessary to insert delays (implemented as shift-registers) into the circuit in order to synchronize the different inputs to an operator.
The latency of a single-precision output multiplier depends strongly on which n bits of the result are provided at the output because the low order bits are naturally available before the higher order bits. In analyzing the latency of such multipliers, the following definition of latency will be used. If bit k of the double precision product is available at the output at time M after the start of the calculation, then the multiplier is defined to have latency M-k. Note that this definition is consistent, since k+1 will normally be provided at the output at time M+1.
In an important paper on bit-serial multipliers by R.F. Lyon entitled "Two's Complemment Pipeline Multipliers", IEEE Transactions on Communications, vol. COM-12, April 1976, pages 418 to 425, a number of bit-serial multipliers which carry out multiplication of two's complement numbers were proposed. Lyon's basic two's complement multiplier shown in FIG. 5 of the paper has a latency of n for the multiplication of two n-bit quantities. It will accept a new value every n+1 cycles which means one stuffbit must be inserted. As presented by Lyon, it will provide at the output only the high order word of the product. However, it is an easy matter to modify the circuit so that the lower order part of the product is also made available. It is also possible to modify the multiplier in such a way that the latency is reduced to zero by removing delays between stages as in FIG. 4 of Lyon's paper. This however means that the sum output must ripple through n adders at each cycle. This will have a deleterious effect on clock rate and is contrary to the general bit-serial architectural style.
Fully piplined low latency bit-serial multipliers are described by I-Ngo Chen and Robert Willoner in "An O(n) Parallel Multiplier with Bit-Sequential Input and Output", IEEE Transactions on Computers, vol. c-28, no. 10, October 1979, pages 721 to 727, and by Noel R. Strader and V. Thomas Rhyne in "A Canonical Bit-Sequential Multiplier" IEEE Transactions on Computers, vol. c-31, no. 8, August 1982, pages 791 to 795. These multiplers have two main drawbacks. Firstly, they are not easily extended to two's complement calculation, operating only on unsigned integers. Secondly, they can accept new input data only every 2n cycles.
The design of the Chen et al, and Strader et al. multipliers was modified as described by Joseph T. Scanlon and W. Kent Fuchs in "High Performance Bit-Serial Multiplication", Proceedings of the IEEE international Conference on Computer Design, pages 114 to 117 (1986). This modification allowed new input data every n+1 cycles. This was accomplished by means of an ingenious, but somewhat cumbersome bidirectional array of multiplier slices. The design is easily further modified to allow new samples every n cycles by the addition of one extra bit slice. The main drawback of this multiplier, however, is that it does not handle two's complement numbers easily. Furthermore, the external control circuitry required is complex, since control signals and input data must be fed to alternate ends of the multiplier array. Furthermore, the output data of consecutive calculations come from alternate ends of the multiplier array. Extensive multiplexing is necessary to make this behaviour transparent.
The multiplier proposed by Scanlon et al. relies on the observation that the individual cells in the Chen et al. and Strader et al. arrays are underused, being used on the average only half of the time. The bidirectional scheme which Scanlon et al. propose takes advantage of this. However, their solution still under-utilizes the computational elements, since data furnished to one of the inputs of the adders is fixed as zero during the entire second phase of the multiplication.
The main problem to be solved is how to make a bit-serial multiplier with low latency handle a continuous stream of input data without gaps, using two's complement operations, at a reasonable hardware cost. This is easy with high latency multipliers, but no solution for low latency multipliers has been previously proposed. The small latency requirement is desirable for a number of reasons. Many applications require semiconductor chips to have a small latency in order to meet exacting timing schedules at the chip level. A small latency multiplier will cut down the latency of a complete chip. Within a single chip, small latency computational elements lead to area savings because of the reduced numbers of delays needed in the chip. If latencies are large, it will often become necessary to delay signals by large amounts in order to synchronize the circuit. This delay represents a hardware cost, since it is usually implemented using a shift register. The third reason why small latencies are desirable is for feedback loops. For instance, in the implementation of an expression such as y(t)=y(t-1)*m(t), where t represents time, it is necessary that the latency of the multiplication be at most equal to one sample period in order that the result, y(t), be ready by the time the next input data, m(t+1), is available.
Another attribute of a multiplier which may be required in the bit-serial compiler context is the capability of providing a full-precision product. The product of two n bit two's complement numbers is a 2n-1 bit quantity. For such applications as accumulation of partial-products, it is often desirable to retain full precision from the output data. There is another reason for retaining both words of precision in a muliplier, even if the required output data is only single precision. Depending on the position of the implied binary point in the data word, a different set of bits will be required in the output data. In general, if two numbers of n bits with k fractional binary places (call this n.k format) are multiplied, then by retaining bits &lt;n+k-1:k&gt; of the full precision product, the correct product in the same n.k format is obtained. By providing all 2n bits of the result, the multiplier has the capacity of doing fixed point multiplication of numbers of any number of places.