1. Field of the Invention
The present invention is related to the field of hardware multiplication circuits. In particular, the present invention is related to an apparatus for performing Montgomery multiplication.
2. Description of the Related Art
Modular arithmetic, especially modular exponentiation is an integral part of cryptographic algorithms. In order to achieve optimal system performance while at the same time maintaining system security, modular exponentiation is often implemented in hardware. Traditional modular exponentiation relies on repeated modular multiplication. Montgomery multiplication is used to implement modular exponentiation because it is often more easily implemented in hardware.
pseudocode for a Montgomery multiplier of radix 2 is as follows:P=0;    For i=0 to N−1
{If mplier [i] = = 1P = P + mpcand[1]If P is odd[2]P = P − modulus[3]P = P/2[4]}In the Montgomery multiplier pseudocode, [1] is referred to as the multiplicand add and                is referred to as the modulus-add. As illustrated in [2] in the pseudocode, the test for implementing the modulus-add is whether the partial remainder P is odd. However, this is only true for a Montgomery multiplier of radix 2.        
For higher order radices the Montgomery multiplier pseudocode has to be modified. For example, for a Montgomery multiplier of radix 4, at [2] an integral number of the modulus may have to be added or subtracted such that, after the addition or subtraction the partial remainder P must be evenly divisible by the radix. Also, [4] has to be modified such that the partial remainder P is divided by the radix. In addition, the for-loop is modified to account for the higher order radix.
Implementing the Montgomery multiplier in hardware implies that the redundant form of values can be used so that add/subtract operations can be performed in constant time. For a Montgomery multiplier of radix 2, the test to determine whether the partial remainder P is odd is easily implemented in hardware (e.g., using one or more exclusive-OR gates).
Montgomery multiplication introduces an additional term (i.e., a radix r, where r=2n) in the multiplication such that, 2n>M where M in cryptography is a key. In addition, at the start of a modular exponentiation, a modular multiplication (i.e., r2 mod M) is performed. However, in cryptography, the result of this modular multiplication only changes when the key M is changed, whereas many Montgomery multiplications will be performed for a given value of M. Therefore, the system resources expended in performing the modular multiplication r2 mod M are negligible.
FIG. 1 illustrates a block diagram of a prior art embodiment of a Montgomery multiplier. As FIG. 1 illustrates, the Montgomery multiplier 100 comprises a first set of one or more multiplexers 105A–N. A multiplier bit 127 controls each multiplexer 105A–N, such that, when the multiplier is 0, a 0 is output from the multiplexer; and when the multiplier bit is a 1, a multiplicand bit is output. Therefore, each of the multiplexers 105A–N has at its inputs a 0 bit and a different multiplicand bit. For the embodiment of FIG. 1, each multiplexer 105A–N is effectively an AND gate.
Each multiplexer 105A–N has its output coupled to one input of a corresponding CSA in a first set of one or more carry save adders (CSAs) 110A–N. Each CSA has three inputs and two outputs. The other two inputs of each CSA in the first set of CSAs 110A–N are selectively coupled to the outputs of a set of flip-flops 125A–M. Thus, for each intermediate cycle through the Montgomery multiplier, the partial remainder present at the output of the flip-flops 125A–M are recirculated back into CSAs 110A–N respectively. The multiplexers 105A–N and the CSAs 110A–N implement the multiplicand add [1] of the Montgomery multiplier pseudocode.
The two outputs of each CSA 110A–N comprise a sum output and a carry output. The sum output of each CSA in the first set of CSAs 110 A–N is coupled to an input of a corresponding CSA in a second set of one or more CSAs 120A–N. The carry output of each CSA in the first set of CSAs 110A–N (with the exception of 110A) is coupled to the other input of a different CSA in the second set of CSAs 120A–N, that is to the immediate left of the CSA to which the sum output is connected. Thus, the sum output of CSA 110B is connected to one input of CSA 120B, and the carry output of CSA 110B is connected to one input of CSA 120A.
A second set of one or more multiplexers 115A–N are coupled to the input of a corresponding CSA in the second set of CSAs 120A–N. Each multiplexer in the second set of multiplexers 115A–N is controlled by a quotient decision circuit 150. The quotient decision circuit inspects the partial remainder output from the rightmost CSA, (e.g., CSA 110N) and determines whether the partial remainder of the quotient is odd as illustrated in [2] of the Montgomery multiplier pseudocode. If the partial remainder of the quotient is odd, a corresponding modulus bit is added to the outputs from the first set of CSAs 110A–N. Otherwise, a 0 bit is added to the outputs from the first set of CSAs. Thus, each multiplexer in the second set of multiplexers 115A–N has either a 0 bit or a different modulus bit at each of its inputs, and inputs a bit to the corresponding CSA in the second set of CSAs 120A–N depending on the output from the quotient decision circuit 150. Thus, the quotient decision circuit 150 determines the value of the quotient for the current cycle through the Montgomery multiplier. The second set of multiplexers 115A–N along with the second set of CSAs 120A–N perform the modulus-add [3] illustrated in the Montgomery multiplier pseudocode.
The sum and carry outputs from the second set of CSAs 120A–N are shifted right one bit and selectively re-circulated (i.e., to implement the loop) into inputs of the first set of CSAs 110A–110N via a set of flip-flops 125A–M. For example, the sum output of CSA 120A is selectively coupled into CSA 110B via a flip-flop in the set of flip-flops 125A–M; while, the carry output from CSA 120B is coupled to CSA 110B. Shifting of the output of each CSA 120A–N by one bit corresponds to the division of the partial remainder (i.e., the value present at the sum output of the CSAs 120A–N) by 2 as illustrated in [4] of the pseudocode. The sum output of the rightmost CSA 120N, in the second set of CSAs, is guaranteed to be ‘0’ by the quotient decision circuit 150 and is therefore ignored. Since the result of the Montgomery multiplier 100 is in redundant form, the result of the Montgomery multiplication is the sum of two vectors. In particular, the result of the Montgomery multiplication is the sum of the vector represented by the binary bits at the sum output of CSAs 120A–N (the sum vector), and the binary bits at the carry output of CSAs 120A–N (the carry vector). As will be further described later herein, the Montgomery multiplier of FIG. 1 contains one stage that processes one multiplier bit per cycle. Since this one stage processes one multiplier bit per cycle, the Montgomery multiplier of FIG. 1 processes one multiplier bit per cycle.
FIG. 5 illustrates a block diagram of a prior art embodiment of a Montgomery multiplier having two stages that each use booth recoding of radix 4. As illustrated in FIG. 5, the Montgomery multiplier 500 comprises two stages (i.e., stage 582 and stage 583). Stage 582 comprises a first set of one or more multiplexers 506A–N (i.e., multiplicand-add multiplexers). Multiplier bits 571 control each multiplexer 506A–N. Each multiplexer 506A–N has the following inputs: −2× the multiplicand, −1× the multiplicand, 0, 1× the multiplicand and 2× the multiplicand.
Each multiplexer 506A–N has its output coupled to one input of a corresponding CSA in a first set of one or more carry save adders (CSAs) 507A–N. Each CSA has three inputs and two outputs. The other two inputs of each CSA in the first set of CSAs 507A–N are selectively coupled to the outputs of a set of flip-flops 516A–M. Thus, for each intermediate cycle through the Montgomery multiplier, the partial remainder present at the output of the flip-flops 515A–M are recirculated back into CSAs 507A–N. The multiplexers 506A–N and the CSAs 507A–N implement the multiplicand add of the Montgomery multiplier (similar to the multiplicand add of pseudocode [1], except that more bits are being processed because this stage of the Montgomery multiplier of FIG. 5 employs booth recoding).
The two outputs of each CSA 507A–N comprise a sum output and a carry output. The sum output of each CSA in the first set of CSAs 507 A–N is coupled to an input of a corresponding CSA in a second set of one or more CSAs 509A–N. The carry output of each CSA in the first set of CSAs 507A–N is coupled to the other input of a different CSA in the second set of CSAs 509A–N, that is to the immediate left of the CSA to which the sum output is connected. Thus, the sum output of CSA 507A is connected to one input of CSA 509A, and the carry output of CSA 507A is connected to one input of CSA 509B.
A second set of one or more multiplexers 511A–N (i.e., modulus-add multiplexers) are coupled to the input of a corresponding CSA in the second set of CSAs 509A–N. Each multiplexer in the second set of multiplexers 511A–N is controlled by a quotient decision circuit 535. The quotient decision circuit inspects the sum and carry outputs from CSAs 507A, the sum output of CSA 507B as well as the values of modulus bits 0 and 1 (modulus bit 0 is always a ‘1’), and determines what integer multiple of the modulus must be added to make the partial remainder of the quotient evenly divisible by the radix. One skilled in the art will appreciate that when booth recoding of radix 4 is employed each modulus-add multiplexer has values of 0, 2× modulus, −1× modulus, and 1× the modulus value present at its inputs. Alternatively, each modulus-add multiplexer has values of 0, −2× modulus, −1× modulus, and 1× the modulus value present at its inputs. Thus, each multiplexer in the second set of multiplexers 511A–N inputs a bit to the corresponding CSA in the second set of CSAs 511A–N depending on the output from the quotient decision circuit 535. Thus, the quotient decision circuit 535 determines the value of the quotient for the current cycle through the Montgomery multiplier. The second set of multiplexers 511A–N along with the second set of CSAs 509A–N perform the modulus-add illustrated in the Montgomery multiplier pseudocode (similar to the modulus add of pseudocode [3], except that more bits are being processed because this stage of the Montgomery multiplier of FIG. 5 employs booth recoding).
As illustrated in FIG. 5, stage 582 of the Montgomery multiplier is duplicated to form second stage 583. The output from stage 582 is input into stage 583 as follows: The sum and carry outputs from CSAs 509B–N are shifted right two bit positions and input into CSAs 513A–N respectively. One skilled in the art will appreciate that the sum outputs from the first stage (i.e., from CSAs 509A–N are right shifted two bit positions and input into CSAs 513A–N because booth recoding of radix 4). Due to the shift, three bits (i.e., the sum and carry outputs of CSA 509A, and the sum output of CSA 509B) are shifted off the right edge of the multiplier circuit and are ignored. The right shifting of the output of each CSA 509A–N two bit positions is equivalent to dividing the partial remainder of the Montgomery multiplier by 4.
The output from stage 583 is selectively fed back into stage 582 via flip-flops 515A–M. In particular, the sum and carry outputs from the second set of CSAs 514B–N in stage 583 are shifted right two bit positions and selectively re-circulated (i.e., to implement a loop) into inputs of the first set of CSAs 507A–N via a set of flip-flops 515A–M. Due to the shift, three bits (i.e., the sum and carry outputs of CSA 514A, and the sum output of CSA 514B) are shifted off the right edge of the multiplier circuit and are ignored. Thus, the carry output of CSA 514A along with the sum and carry outputs from CSAs 514B–N are selectively re-circulated to CSAs 507 A–N. Shifting of the output of each CSA 514A–N by two bit positions corresponds to the division of the partial remainder by 4.
Since the result of the Montgomery multiplier 500 is in redundant form, the result of the Montgomery multiplication is the sum of two vectors. In particular, the result of the Montgomery multiplication is the sum of the vector represented by the binary bits at the sum output of CSAs 514A–N (the sum vector), and the binary bits at the carry output of CSAs 514A–N (the carry vector).
In the example of FIG. 5, each cycle through the Montgomery multiplier processes 4 multiplier bits (i.e., each of the two stages processes two bits). Therefore, one skilled in the art will appreciate that the number of bits processed per cycle depends on the number of stages in the Montgomery multiplier, and the number of bits processed per stage.
The determination of what integer multiple of the modulus must be added to make the partial remainder of the quotient evenly divisible by the radix is time consuming, and requires at least a few gate delays. Moreover, broadcasting the decision from the quotient decision circuit to control the multiplexers in the second set of multiplexers, and the subsequent calculation of the modulus-add result increases the time needed to calculate the result of the Montgomery multiplication. This is especially true in the case wherein booth recoding is employed and multiple multiplier bits are processed during each cycle through the Montgomery multiplier.