This invention relates to the field of data processing systems. More particularly, this invention relates to data processing systems of the type that perform multiplication operations.
There are a number of ways in which multiplication of two W bit numbers M and N may be performed. For example, a Wxc3x97W bit multiplier may be provided for producing the multiplication result Mxc3x97N directly. However, the larger the multiplier circuit is, then generally the more power and circuit area it will consume, and accordingly in applications where reduction in power and circuit area are of importance, it is known to provide a Wxc3x97W/2 multiplier that can perform two separate multiplications which are then summed together to produce the result Mxc3x97N. Hence, the multiplication Mxc3x97N is performed as follows:
Mlowerxc3x97N+Mupperxc3x97N
In the above equation, Mlower indicates the least significant W/2 bits of M, whereas Mupper indicates the most significant W/2 bits of M. The first multiplication above will be referred to as the lower multiplication, whilst the second multiplication will be referred to as the upper multiplication.
Both of the above multiplications produce a result which is 3W/2 bits wide, but the upper multiplication result is shifted to be W/2 bits more significant than the lower multiplication result. Hence, when the two multiplication results are added, the final multiplication result will be 2W bits wide as indicated below:                               Lower          ⁢                      xe2x80x83                    ⁢          Product                                              Upper          ⁢                      xe2x80x83                    ⁢          Product                      +                              xe2x80x83                            A                    A                    A                            B                    B                    B                              xe2x80x83                                    C                    C                    C                    C            
(where each capital letter represent a W/2 bit number)
Such a multiplication is typically achieved by calculating the lower multiplication result Mlowerxc3x97N first, and then recirculating part of the result for accumulation into the upper multiplication Mupperxc3x97N. It should be noted that the least significant W/2 bits of the final multiplication result are identical to the least significant W/2 bits of the lower product, but the same does not apply for the most significant W/2 bits of the final multiplication result when compared with the most significant W/2 bits of the upper product, because a carry may propagate up the chain.
Although the upper and lower products are shown offset from each other by W/2 bits, they are produced in the same bit positions in the final adder of the multiplier. This has the consequence that the least significant W/2 bits of the final multiplication result, which are available once the lower product has been calculated, must be stored immediately, because they will be overwritten by the rest of the multiplication result after the upper product has been calculated and added to the relevant bits of the lower product. Extra logic then needs to be provided to produce the final multiplication result from the two intermediate results, i.e. the previously stored least significant W/2 bits of the multiplication result and the rest of the multiplication result subsequently output by the final adder of the multiplier. In addition, further logic is also required to allow full carry propagation when performing an accumulation of the two separate multiplication results as discussed above.
Generally, it is desirable to reduce power consumption and circuit complexity wherever possible, and accordingly it would be desirable to provide a technique which enables two W bit data words to be multiplied together using a multiplying circuit that is arranged to perform a multiplication of a W2 bit data value by a W bit data value whilst enabling reduction in the power consumption and complexity of the multiplying circuit in relation to the above discussed prior art.
Viewed from a first aspect, the present invention provides apparatus for processing data, said apparatus comprising: a multiplying circuit for performing a multiplication of a W/2 bit data value by a W bit data value; an instruction decoder responsive to a multiply instruction to control said multiplying circuit to generate a multiplication result for the computation Mxc3x97N, where M and N are W bit data words, the multiplying circuit being arranged to execute a first operation in which the data word N is multiplied by the most significant W/2 bits of the data word M to generate a first intermediate result having 3W/2 bits, and to then execute a second operation in which the data word N is multiplied by the least significant W/2 bits of the data word M to generate a second intermediate result having 3W/2 bits, the first intermediate result being shifted by W/2 with respect to the second intermediate result and added to the second intermediate result to generate the multiplication result.
In accordance with the present invention, a multiply instruction is provided which causes the multiplying circuit to perform the two constituent multiplication operations in reverse order to that performed in the earlier-described prior art approach. Since the first operation is used to multiply the data word N with the most significant W/2 bits of the data word M, this first operation will not directly produce any bits of the multiplication result, and accordingly any final adder circuitry provided within the multiplying circuit can be turned off when the first operation is executing, thereby reducing power consumption. Further, since none of the bits of the multiplication result are produced by the first operation, the multiplying circuit will not output any bits after execution of the first operation which require storing, and further there is no need for any extra logic as was required in the prior art approach to concatenate a data value output after execution of the first operation with a data value produced in a subsequent operation.
The prior art approach, whereby the least significant W/2 bits of the multiplicand are multiplied by the multiplier, and then the upper W/2 bits of the multiplicand are multiplied by the multiplier, with the appropriately shifted results then being summed to produce the final multiplication result, is the most intuitive approach, as it appears in keeping with the requirement to propagate a carry from the least significant bit to the most significant bit where necessary. Further, this prior art approach would appear to provide good processing speed in certain instances, since considering the example where a 2W bit result is to be produced, the least significant W/2 bits of the result are generated from the lower product and the remaining 3W/2 bits are generated from the upper product, i.e. only two operations seem necessary.
However, in practice, the perceived speed of the prior art approach is often adversely affected, since, for example, the register bank into which the result needs to be placed may comprise W bit registers, and may only have one write port. In such situations it takes two cycles to write to the register bank the 3W/2 bits of the result produced by the upper product.
In contrast to the prior art approach, the approach of the present invention, whereby the two operations are reversed, is entirely counterintuitive, but has been found to produce the above-described surprising benefits of reducing the overall complexity of the data processing apparatus, and facilitating reduction in power consumption.
In accordance with a first embodiment, the multiply instruction specifies a W bit multiplication result, and the second operation is further arranged to cause the multiplying circuit to sum the least significant W bits of the first and second intermediate result to generate a third intermediate result having 3W/2 bits, the multiplication result being given by the least significant W bits of the third intermediate result. In accordance with this embodiment, the W bit multiplication result is produced in one go at the end of the second operation. It will be seen that when compared with the standard prior art approach, where the least significant W/2 bits are produced after execution of the first operation, the most significant W/2 bits are produced after execution of the second operation, and then extra logic is provided to concatenate together the two separate parts of the results, the technique of the preferred embodiment of the present invention enables the complexity of the data processing apparatus to be significantly reduced, by avoiding the need for such extra logic. Further, as mentioned earlier, since no part of the multiplication result is output by the multiplying circuit after the first operation, any final adder circuitry within the multiplying circuit can be turned off during execution of the first operation, thereby conserving power.
In accordance with a second embodiment of the present invention, the multiply instruction specifies a 2W bit multiplication result, the second operation is further arranged to cause the multiplying circuit to sum the least significant W bits of the first and second intermediate result to generate a third intermediate result having 3W/2 bits, and the multiplying circuit is further arranged to execute a third operation in which the most significant W-bits of the third intermediate result and the most significant W/2 bits of the first intermediate result are summed to generate a fourth intermediate result having 3W/2 bits, the multiplication result being given by the least significant W bits of the third intermediate result and the most significant W bits of the fourth intermediate result.
Hence, in preferred embodiments, to produce a 2W bit multiplication result, three separate operations are required, the least significant W bits of the multiplication result being available after execution of the second operation, and the most significant W bits of the multiplication result being available after execution of the third operation. However, as mentioned earlier, the multiplying circuit does not output any data value when executing the first operation, and accordingly any final adder circuitry within the multiplying circuit can be turned off when executing the first operation.
Further, in preferred embodiments, the complexity is also reduced, since the result is written to two W bit registers, the least significant W bits being generated from the third intermediate result, and the most significant W bits being generated from the fourth intermediate result. This should be contrasted with the prior art approach where extra logic is needed to concatenate the least significant W/2 bits of the result with the next W/2 bits of the result generated by the subsequent operation, prior to the value being written to a W bit register.
It will be appreciated that the data words required by the multiplying circuit may be provided from any appropriate storage. However, in preferred embodiments, the apparatus further comprises: a register bank containing a plurality of registers for storing data words required by the multiplying circuit; wherein the multiplying circuit is a pipelined circuit comprising a partial product generating circuit provided in a first pipelined stage and an adder circuit provided in one or more subsequent pipelined stages for adding partial product values, wherein data words required for an operation at a particular pipelined stage are read from the register bank by the multiplying circuit before that operation enters that pipelined stage.
The use of a pipelined circuit provides a particularly efficient technique for executing the various operations that need to be performed by the multiplying circuit, whilst the use of a register bank provides a particularly efficient mechanism for making the data words available for the multiplying circuit as and when required.
Whilst the above described approach of preferred embodiments provides significant benefits over the prior art approach when solely performing a multiplication of two data words M and N, the benefits are particularly marked when performing multiply-accumulate operations. Accordingly, in preferred embodiments, the multiplying circuit is a multiply-accumulate circuit, and said multiply instruction is a multiply-accumulate instruction specifying at least one W bit accumulate data word O in addition to the data words M and N, the instruction decoder being responsive to the multiply-accumulate instruction to control said multiply-accumulate circuit to generate a multiply-accumulate result for the computation Mxc3x97N+O, the multiply-accumulate circuit being arranged to execute the first operation to generate the first intermediate result having 3W/2 bits, and the second operation being further arranged to incorporate summation of the at least one accumulate data word O with the result of the multiplication of the data word N by the least significant W/2 bits of the data word M to generate a second intermediate result having 3W/2 bits, the first intermediate result being shifted by W/2 with respect to the second intermediate result and added to the second intermediate result to generate the multiply-accumulate result.
By the above approach, the accumulate data word O is not required until the second operation, and accordingly this provides additional time to prepare the accumulate data word O for inclusion in the multiply-accumulate operation. In certain implementations, this extra time can be particularly valuable, and can avoid the performance of the multiply-accumulate circuit being adversely affected by the need to include stall cycles whilst waiting for the accumulate data word O. For example, multiply instructions with accumulate are often used back-to-back, i.e. the next instruction uses the result of the previous instruction as its accumulate data word. With a pipelined processor, this can cause stall cycles to be inserted since, when using the prior art technique, the next instruction must wait for the previous instruction to complete before it can start, thereby reducing performance. However, in accordance with preferred embodiments of the present invention, where the multiplication is effectively performed in reverse, the accumulate data word is not actually required for the first operation, and hence the next instruction can actually begin before the previous instruction has completed, thereby enabling performance to be increased.
In a first embodiment, the multiply-accumulate instruction specifies a W bit multiply-accumulate result, and the second operation is further arranged to cause the multiply-accumulate circuit to sum the least significant W bits of the first and second intermediate result to generate a third intermediate result having 3W/2 bits, the multiplication result being given by the least significant W bits of the third intermediate result. Hence, as discussed earlier, the W bit multiply-accumulate result is produced in one go after completion of the second operation, thereby enabling the complexity of the circuitry to be reduced.
In accordance with the second embodiment, the multiply-accumulate instruction specifies a 2W bit multiply-accumulate result, the second operation is further arranged to cause the multiply-accumulate circuit to sum the least significant W bits of the first and second intermediate result to generate a third intermediate result having 3W/2 bits, and the multiply-accumulate circuit is further arranged to execute a third operation in which the most significant W-bits of the third intermediate result and the most significant W/2 bits of the first intermediate result are summed to generate a fourth intermediate result having 3W/2 bits, the multiply-accumulate result being given by the least significant W bits of the third intermediate result and the most significant W bits of the fourth intermediate result.
It will be appreciated that when the multiply-accumulate instruction specifies a 2W bit multiply-accumulate result, there is no requirement that any accumulate data words are only W bits in length. Accordingly in one embodiment, the multiply-accumulate instruction specifies a 2W bit accumulate data value in two data words O and P, where data word O represents the most significant W bits of the accumulate data value and data word P represents the least significant W bits of the accumulate data value, the summation of data word O into the multiplication being performed by the first operation, and the summation of data word P into the multiplication being performed by the second operation.
When executing such a multiply-accumulate instruction, the data word O representing the most significant W bits of the accumulate data value needs to be available for use by the first operation, whereas the data word P representing the least significant W bits of the accumulate data value is not required until the second operation is executed.
It will be appreciated that there is no requirement for the multiply-accumulate instruction to only specify a single accumulate data value, but rather a plurality of accumulate data values may be specified. In accordance with one embodiment, the multiply-accumulate instruction specifies two W bit accumulate data words O and P, the summation of both accumulate data words into the multiplication being performed by the second operation. Accordingly, such a multiply-accumulate instruction specifies a computation Mxc3x97N+O+P.
Typically, such multiply-accumulate instructions which specify more than one accumulate data value can cause the multiply-accumulate circuit to introduce stall cycles if the interface with the memory storing the input data values does not allow all of those data values to be output at one time.
In preferred embodiments, the data words required by the multiplication circuit are stored within a register bank containing a plurality of registers, and the multiply-accumulate circuit is a pipelined circuit comprising a partial product generating circuit provided in a first pipelined stage and an adder circuit provided in one or more subsequent pipelined stages for adding partial product and accumulate values, and wherein data words required for an operation at a particular pipelined stage are read from the register bank by the multiply-accumulate circuit before that operation enters that pipelined stage.
If the prior art multiplication approach was employed, all of the accumulate data values would be required for use in the first operation, and hence in effect all of the data words M, N, O and P would have to be read from the register bank before the first operation could be executed. However, given cost and complexity considerations, a typical register bank will only be provided with a relatively small number of read ports, and hence the multiply-accumulate circuit may not be able to read all of the required data words at the same time. This can cause stall cycles to be inserted if the typical prior art multiplication approach is used, thereby adversely affecting performance.
In preferred embodiments, the register bank has three read ports. However, since the accumulate data words are not actually required for the first operation, this constraint does not adversely affect performance. Instead, in accordance with preferred embodiments, the multiply-accumulate circuit is arranged to read the first accumulate data word O from the register bank before the first operation enters the one or more subsequent pipelined stages, and is arranged to read the second accumulate data word P from the register bank before the second operation enters the one or more subsequent pipelined stages, whereby both the accumulate data words O and P are available to the multiply-accumulate circuit when the second operation enters the one or more subsequent pipelined stages. Hence, by the time the second operation enters the one or more subsequent pipeline stages that are used for adding partial products and the accumulate values, both of the accumulate data words O and P are available.
In preferred embodiments, the first pipeline stage further includes a multiplexer for receiving the accumulate data words O and P from the register bank and the most significant W/2 bits of the first intermediate result, and being arranged, prior to the third operation entering the one or more subsequent pipelined stages, to output the most significant W/2 bits of the first intermediate result for use by the adder circuit in generating the fourth intermediate result. Accordingly, this multiplexer can be controlled to output appropriate values for inputting to the adder circuit, depending on the operation about to be executed by the adder circuit.
In accordance with preferred embodiments of the present invention, it is required that some shifting of the first intermediate result relative to the second intermediate result be performed prior to the two intermediate results being added together. In preferred embodiments, the apparatus further comprises a conditional shift circuit for receiving the intermediate result of a previous operation and for outputting either the least significant W bits of that intermediate result over left-shifted data paths into the adder circuit or the most significant W bits of that intermediate result over non-shifted data paths into the adder circuit. Hence, this conditional shift circuit can be arranged such that when the second operation is to be executed by the adder circuit, the first intermediate result is passed over shifted data paths into the adder circuit thereby enabling the second operation to be performed by the adder circuit. Equally, when a third operation is to be executed by the adder circuit, as required for a 2W bit result, the conditional shift circuit can be arranged to select non-shifted paths.
Whilst it is possible that the data word or data words representing the multiply-accumulate result could be written into registers entirely separate to those storing the input data words, in preferred embodiments of the present invention, the registers that store the input data words O and P also serve to store the data words of the multiply-accumulate result. This feature helps to reduce the bit space required for operand specification within the instruction.
It will be appreciated that W may be any appropriate value. However, in preferred embodiments, W=32, and accordingly the input data words are 32 bits in length.
Viewed from a second aspect, the present invention provides a method of processing data within a data processing apparatus having a multiplying circuit for performing a multiplication of a W/2 bit data value by a W bit data value, the method comprising the steps of: responsive to a multiply instruction, controlling said multiplying circuit to generate a multiplication result for the computation Mxc3x97N, where M and N are W bit data words by: (i) executing a first operation in which the data word N is multiplied by the most significant W/2 bits of the data word M to generate a first intermediate result having 3W/2 bits; (ii) executing a second operation in which the data word N is multiplied by the least significant W/2 bits of the data word M to generate a second intermediate result having 3W/2 bits; and (iii) shifting the first intermediate result by W/2 with respect to the second intermediate result and adding the second intermediate result to generate the multiplication result.
Viewed from a third aspect, the present invention provides a computer program product carrying a computer program for controlling a data processing apparatus in accordance with the method of the second aspect of the present invention.