1. Field
The present disclosure pertains to the field of data processing, and more particularly, to the field of cryptography.
2. Description of Related Art
The computations required to implement many cryptographic protocols, such as the Rivest-Shamir-Adleman and Diffie-Hellman algorithms for public key encryption, include modular exponentiation. A straightforward approach to modular exponentiation requires dividing a multiplication product by a modulus to find the remainder. The modulus is typically a large number, so the division operation may be very slow. Therefore, a technique known as Montgomery multiplication may be used to perform modular exponentiation more efficiently.
Montgomery multipliers perform modular exponentiation using a transformation that allows the division operation to be replaced by shift operations. For example, the following radix-2 Montgomery multiplication algorithm computes the result, Z, from multiplier X, multiplicand Y, and modulus M, by generating and right shifting a running sum for each bit xi of the n-bit multiplier:
Z = 0for i = 0 to n−1Z = Z + xiYif Z is odd then Z = Z + MZ = Z/2if Z ≧ M then Z = Z − M
Some Montgomery multipliers include a linear systolic array, i.e., a chain of identical processing elements (“PE”s), where each PE works on a portion of the computation (e.g., four bits of a 1,024 bit number) at the same time. For example, in the following algorithm, the n-bit result Z, multiplicand Y, and modulus M are each broken down into e words, to allow each PE to work on a different word of the computation at the same time. In this algorithm, word positions are represented using superscripts, bit positions are represented using subscripts, and the concatenation of bits and/or words A and B is represented as (A,B).
Z = 0for i = 0 to n−1(Ca,Z0) = xiY0 + Z0odd = z0if odd then(Cb,Z0) = M0 + Z0for j = 1 to e(Ca,Zj) = Ca + xiYj + Zjif odd then(Cb,Zj) = Cb + Mj + ZjZj−1 = (Z0j, Zw−1:1j−1)
In a typical hardware implementation of this algorithm, the iterations of a j-loop would occur sequentially in one PE, but the iterations of the i-loop would occur in different PEs, allowing the i-loop iterations to overlap. For example, the first PE would work on x0, Y0, M0, and Z0 during the first clock cycle; x0, Y1, M1, and Z1 during the second clock cycle; x0, Y2, M2, and Z2 during the third clock cycle; and so on. The second PE would work on x1, Y0, M0, and Z0; x1, Y1, M1, and Z1; x1, Y2, M2, and Z2; and so on, sequentially, but could not start until the third clock cycle. The reason for this delay is that the value of the first word of the running sum based on x0 is not available until the first PE has worked with x0, Y1, M1, and Z1 in the second clock cycle, because the least significant bit of Z1 must be shifted to the most significant bit of Z0 in the last step of the j-loop for j=1.