A. Field of the Invention
This invention relates to a multibit multiplier which increases performance for the multiply operations of a large scale processor by providing a building block which may be used in a multiplication pipeline of arbitrary size.
B. Prior Art
In the past, multiplication was most often performed in digital data processors by repetitively adding the multiplicand to itself the number of times indicated by the multiplier. As speed increased, this repetitive addition by the machine's adder gave way to the use of individual multiplier circuits. Today, most contemporary large scale processors utilize specialized networks to accomplish faster multiply operations. These networks are designed to simultaneously process several multiplier bits. Further, a number of specialized algorithms have been developed to simplify the logic used in the building of these arrays. Generally, this simplification involves the decoding of multiplier groups and interaction between the multiplier groups to control the array.
This approach, of course, complicates the control aspect, while simplifying the array. Such arrays are commonly built of a combination of adders and multiplexers, with multiplier decoders used for control. The adders used may be of the conventional type or they may take the carry/save form.
Another common feature in past digital data processors has been the inclusion of error detecting networks in all data paths of the machine. However, present day large scale integrated (LSI) devices do not provide for the inclusion of these error detecting networks in the usual manner. This leaves duplication and result comparison as the only available method of error detection for these LSI devices.
One specific earlier example of a fast multiplier was suggested in the article published in IEEE TRANSACTIONS ON ELECTRONIC COMPUTERS in an article by C. S. Wallace, entitled "A Suggestion for a Fast Multiplier", on pages 14-17 of the February, 1964 issue. The article suggests that the economics of large scale scientific computers of that day could benefit from greater investment in hardware to mechanize multiplication that was common in computers of a still earlier vintage.
Multiplication of binary fractions is normally implemented as the addition of a number of summands, each some simple multiple of the multiplicand, chosen from a limited set of available multiples on the basis of one or more multiplier digits. No good reason is known to depart from this general scheme. Acceleration of the process must then be based on one or more of the following expedients: (1) reduction in the number of summands; (2) acceleration of the formation of summands; (3) acceleration of the addition of summands. Acceleration of addition will be discussed initially.
The basic addition processes usually employed in computers add two numbers together. The possibility exists of adding together more than two numbers in a single adder to produce a single sum. However, the logical complexity of the adder required appears to grow quite disproportionately to the resulting increase in speed, consequently there appears to be no advantage in trying to sum even three numbers at a time into a single sum.
Another expedient, now quite commonly used, is to employ a pseudoadder which adds together three numbers, but rather than producing a single sum, produces two numbers whose sum equals that of the original three. In the context of the basic problem of adding together many summands, one pass through such an adder reduces the number of summands left to be summed by one, as does a pass through a conventional adder. The advantage of the pseudoadder is that it can operate without carry propagation along its digital stages and hence is much faster than the conventional adder. Thus, rather than allowing the possibility of a carry propagation from the lowest-order digit to the highest-order digit, carry propagation is allowed only to the next-highest-order digit. A simple form for such an adder is a string of full adder circuits of the normal sort, where the carry inputs are used for the third input number, and the carry outputs for the second output number. In multiplication, one pseudoadder is usually used, and storage is provided for two numbers. On each pass through the adder, the two stored numbers and one multiple of the multiplicand are added, and the resulting two numbers returned to storage.
In any scheme employing pseudoadders, the number of adder passes occurring in a multiplication before the product is reduced to the sum of two numbers, will be two less than the number of summands, since each pass through an adder converts three numbers of two, reducing the count of numbers by one. To improve the speed of the multiplication, one must arrange many of these passes to occur simultaneously by providing several pseudoadders.
Assuming that all summands are generated simultaneously, the best possible first step is to group the summands into threes, and introduce each group into its own pseudoadder, thus reducing the count of numbers by a factor of 1.5 (or a little less, if the number of summands is not a multiple of three). The best possible second step is to group the numbers resulting from the first step into threes and again add each group in its own pseudoadder. By continuing such steps until only two numbers remain, the addition is completed in a time proportional to the logarithm of the number of summands.
Successive steps may use the same set of pseudoadders (using progressively fewer of the set in each step) by using temporary storage registers for the outputs of the pseudoadders. However, if we used separate adders the equipment cost is little, if at all increased, since the additional pseudoadders required will not need many more components than the flip-flop registers eliminated, and the control circuitry is greatly simplified. Further consideration of this approach will reveal that the purely combinational adder would have a considerable speed advantage. In the simplest form of multiplication, there are as many summands as multiplier digits, each either 0 or 1 times the multiplicand. A wide range of schemes involving recoding the multiplier into a new (possibly redundant) form using some negative digits have been developed to reduce the number of summands. Since all summands are to be generated simultaneously, and then summed very quickly, it is desirable that the recoding scheme used should (1) require only multiples of the multiplicand obtainable by shifting and complementing, and (2) be a local recoding in which each recoded digit depends only on a small group of original multiplier digits. The best system found gives base-four recoded multiplier digits which can be +2, +1,0, -1 or -2, and each is determined entirely by three adjacent original binary multiplier digits. Considering the process as a base-four recoding, digits 0, 1, 2, 3 are recoded into digits 0, 1, -2, -1, respectively, if the next less significant original base-four digit is 0 or 1, and into 1, 2, -1,0 if the next less significant original digit is 2 or 3. The number of summands is half the number of binary multiplier digits. Attempts to reduce the number further appear to require multiples not obtainable by shifting. Some complications arise in the pseudoadder structure because of the negative multiplier digits, which, in a two's complement system, require correction digits to be added in. However, detailed examination shows that these problems are surmountable without loss of speed of undue circuit cost.
Additional schemes for parallel multipliers which are based on different principles and having some advantages over those presented above by C. S. Wallace, have been proposed by L. Dadda in his article entitled Some Schemes for Parallel Adders published in "Alta Frequenza" in March 1965 at pages 349 to 356.
In his conclusions, Mr. L. Dadda suggests that having established the possibility of a parallel digital multiplier, some considerations can now be made about the important aspects of speed and cost than can be encountered in a practical design.
He suggests that if one assumes that a third of all arithmetic operations in scientific computers are multiplications and that these, at present, take about four times as long as additions, the use of a fast multiplier allowing a multiplication in a memory cycle time, would approximately double the speed of computation.
There is therefore a chance that a parallel multiplier could become a convenient means to improve the value of a computer, owing to the fact that its cost can be shown to be only a few percent of the total computer cost.
Let us first note that the total multiplication time is composed of two parts: the first is the time elapsed from the application of the signals representing the two factors to the inputs of the multiplier, to the availability of the inputs to the carry-propagating adder; the second part is the delay proper of the adder, mainly consisting in the carry propagation delay.
In the design of a practical multiplier, one can assume as a goal to obtain a total delay equal or less than the cycle time of the high-speed memory, so that the computer can work at its maximum speed, limited only by the memory speed. The choice of the type of circuits depends therefore on the memory cycle time of present day computers. The increasing use of large scale integration appears to suggest it as the prime path for multipliers as well.
Among the different type of full adders, the most suitable for the application in parallel multipliers, from the point of view of economy and speed, are those which require input variables of one form only (natural or complemented), so that output variables of the same form only must be generated. If such condition is satisfied, outputs of one stage can be used directly as inputs to the next stages, without the need of inverters, leading in general to a considerable saving in components and to a reduction of stage delay.
It must be noted that the above restriction can be partially relieved by allowing the use of counters producing outputs of only one form but different from the input's form.
Another similar multiplier circuit is disclosed in co-pending U.S. patent application Ser. No. 387,644, now U.S. Pat. No. 4,523,210 entitled "Fast Error Checked Multibit Multiplier", by Glen R. Kregness filed June 8, 1982 and assigned to the present assignee. The differences and improvements in the present application will be more clearly understood as one reads and understands the present description. However, generally that earlier application is primarily concerned with the use of a gated carry/save network. Through the use of that network, the need for the decoding of the multiplier bits was eliminated. Although the array itself is increased in complexity, the multiplexers required by earlier systems have been eliminated. In addition, interaction between multiplier groups is no longer required. The net effect is a reduction in logic with faster operation because of the elimination of the decoding.
Even though it is well known, indeed, that many important features of today's fast computers depend on the fact that during operations that last longer than one memory cycle (typically, during multiplication or division) memory can be made available for other operations (e.g. input-output) it is still desirable to process in parallel. Thus, a unique approach for the construction of an LSI multiplier will now be described which provides a unique logic block for use in creating a multiplication pipeline of arbitrary size.