The present invention relates to multiplier and method for multiplication, and more particularly to multiplier and method for multiplication suitable for a vector processor which produces an operation result in each machine cycle.
A multiplier which overlap-scans multiplier's bits to generate a multiple of a multiplicand has been known by "The IBM System/360 Model 91: Floating-Point Execution Unit" by Anderson et al, IBM Journal, pages 34-53, January 1967. In this multiplier, a multiplier is divided and separate multiples of a multiplicand are generated one for each iteration. A half carry and a half sum are shifted right by the number of bits equal to the number of bits of the multiplier used to generate one multiple and they are then added to the next multiple, the half-carry and the half-sum which spill from a carry-save adder tree by the right shifting are summed by a low order bit adder, and when the multiplier for the multiple generation has been assimilated, the half-carries and the half-sums are summed by a carry propagation adder to produce a product. The two outputs of the carry-save adder are known in common as the carry and the sum. In the following description, the carry and the sum are called the half-carry and the half-sum, respectively.
In a recent vector processor called a super computer, it is frequently required that the multiplier can produce one product in each machine cycle. In the multiplier described in the above article or a similar multiplier in which multiplier's bits are divided and the divided multipliers' bits are supplied to the multiple generator in several cycles by iteratively using the common multiple generator to produce a final product, it is impossible to produce one product in each machine cycle.
In the U.S. patent application Ser. No. 653,053 which discloses a vector multiplier capable of producing one product in each machine cycle, a sign generation system of a carry-save adder (CSA) is mainly discussed and explanation of those portions which are not directly related to the sign generation of the carry-save adder is omitted. In FIG. 1, those portions which are related to the operation of the vector multiplier are shown and those portions which are not directly related to the operation of the vector multiplier are omitted.
In FIG. 1, numeral 1 denotes a multiplicand register and numerals 2 to 4 denote multiplicand delay registers which hold the content of the multiplicand register with predetermined time delays. Each of the registers has a 64-bit length. Numeral 5 denotes a multiplier register and numerals 6 to 8 denote multiplier delay registers which hold the content of the multiplier register with predetermined time delays. Those registers have 64-bit, 49-bit, 33-bit and 17-bit lengths, respectively. Numerals 9 to 12 denote CSA trees. Numerals 13 to 18 denote CSA's (carry-save adders), numerals 19, 21, 23 and 25 denote half-carry registers (HC), numerals 20, 22, 24 and 26 denote half-sum registers (HS), numerals 27 to 29 denote spill adders (SPA), numerals 30 to 33 denote spilled bit sum registers (SPAL), numeral 34 denotes a carry propagation adder (CPA) and numeral 35 denotes a carry propagation sum register (CPAL). FIG. 2 shows a detail of the CSA tree. Numerals 201 to 208 denote multiple generators, and numerals 209 to 214 denote carry-save adders. The operation of the CSA tree is described in the Anderson's article and the U.S. Ser. No. 653,053, and hence it is not explained here.
The operation of the vector multiplier shown in FIG. 1 is now explained with reference to a time chart of FIG. 3, in which Ai and Bi denote i-th multiplicand and multiplier, respectively.
First, A1 and B1 are set in a first machine cycle into the multiplicand register (MCAND-1) 1 and the multiplier register (MPLIR-1) 5, respectively. Then, the A1 (64 bits) is multiplied by the 48th to 63rd bits (16 bits) of the B1 by the CSA tree 9. The bits 48-50 of the multiplier, the bits 50-52, the bits 52-54 and so on, which are bit-overlapped to each other, are applied to second inputs of the multiple generators 201-208 of FIG. 2, and the multiplicand is applied to the first inputs. Each of the multiple generators generates a multiple and sends it to one of the CSA's 209-214 of FIG. 2, where they are carry-save added. The resulting half-carry and half-sum are set into the half-carry register (HC-1) 19 and the half-sum register (HS-1) 20, and the whole 64 bits of the A1 is set into the multiplicand delay register (MCAND-2) 2 and the high order 49 bits of the B1 are set into the multiplier delay register (MPLIR-2) 6. At the same time, the next vector elements A2 and B2 are set into the multiplicand register (MCAND-1) 1 and the multiplier register (MPLIR-1) 5, respectively, as shown in FIG. 3, and the same operation as that carried out for A1 and B1 in the previous machine cycle is carried out for A2 and B2.
Then, in the CSA tree 10, the 64 bits of the A1 is multiplied by the 32nd to 48th bits (17 bits) of the B1, the half-carry and the half-sum of CSA tree 10 are added to a partial products of the 64 bits of the A1 and the low order 16 bits of the B1 by the CSA's 13 and 14, they are set into the half-carry register (HC2) 21 and the half-sum register (HS2) 22, respectively and the low order 16 bits which were spilled in the shifting for summation are summed by the spill adder (SPA) 27 and the sum is set into the spilled bit sum register (SPAL-1) 30, at the same time, the 64 bits of the A1 is set into the multiplicand delay register (MCAND-3) 3 and the high order 33 bits of the B1 are set into the multiplier delay register (MPLIR-3) 7.
In the CSA tree 11, the A1 is multiplied by the 16th to 32nd bits (17 bits) of the B1 the half-carry and the half-sum of CSA tree 11 are added to a partial product of the 64 bits of the A1 and the low order 32 bits of the B1 by the CSA's 15 and 16 and they are set into the half-carry register (HC-3) 23 and the half-sum register (HS-3) 24, respectively, and the low order 16 bits which were spilled in the shifting for generating the sum are summed in the SPA 28 to which the latched carry from the SPA 27 is applied as a carry from the low order, although this is not shown in FIG. 1 in order to avoid the complexity. The output of the SPA 28 is combined with a data in the SPAL-1 30, which is sent into the SPAL-2 31. At the same time, the 64 bits of the A1 is set into the multiplicand delay register (MCAND-4) 4 and the high order 17 bits of the B1 and set into the multiplier delay register (MPLIR-4) 8.
In the CSA tree 12, the A1 is multiplied by the 0th to 16th bits (17 bits) of the B1, the half-carry and the half-sum of CSA tree 12 are added to a partial product of the 64 bits of the A1 and the low order 48 bits of the B1 by the CSA's 17 and 18 and they are set into the half-carry register (HC-4) 25 and the half-sum register (HS-4) 26, respectively, and the low order 16 bits which were spilled in the shifting for generating the sum are summed in the SPA 29 to which the latched carry from the SPA 28 is applied as a carry from the low order although this is not shown in FIG. 1 to avoid the complexity. The output of the SPA 29 is combined with a data in the SPAL-2 31, which is set into the SPAL-3 32.
The half carry of the HC-4 25 and the half-sum of the HS-4 26 are summed in the carry propagation adder (CPA) 34 to which the latched carry from the SPA 29 is applied as a carry from the low order, and the resulting sum is set into the CPAL 35, and the data in the SPAL-3 32 is transferred to the SPAL-4 33.
In this manner, the products of the 64.times.64 bits of the A1 and B1 is produced in the CPAL 35 and the SPAL-4 33.
However, in the vector multiplier which divides multiplier's bits and produces a product in each machine cycle by essentially serially connecting the CSA trees as shown in FIG. 1, the multiplicand and multiplier delay registers are required and the proportion of the delay registers in the multiplier is innegligiably large. Secondly a data travel time from the input of the multiplicand and the multiplier to the operand registers to the output of the product is long.