Several conventional encryption protocols require modular multiplication of extremely long numbers (e.g., 1024+ bits) using an arbitrary modulus. This operation requires division by a large prime modulus, and may therefore consume significant computing resources. Montgomery multiplication is a known method for simplifying this operation into shifts which can be implemented using dedicated hardware accelerators. FIG. 1 illustrates a Montgomery multiplication algorithm to generate output Z based on n-bit multiplier X, multiplicand Y and modulus M.
According to the algorithm, w bits of Y are multiplied by a bit of X to produce a running sum of a corresponding w bits of Z. If the least-significant bit of Z is odd, the corresponding w bits of M are added to the running sum. The process is executed until each w bits of Y are multiplied by each bit of X.
FIG. 2 illustrates pipeline timings 200 and 250 to implement the algorithm according to conventional systems. Pipeline timing 200 represents a scenario in which a number p of w-bit processing elements (PEs) is small compared to the total words e to be processed (e==n/w). As shown, pipeline timings 200 and 250 parallelize the outer loop of the algorithm (i.e., i=0, 1, . . . , n−1) by simultaneously operating on adjacent bits of X using adjacent PEs. However, due to read-after-write hazards at bits w-1, 2w-1, 3w-1, etc., stalls are inserted between successive iterations of the outer loop. For example, PE2 does not begin processing until t=3. Moreover, as shown in pipeline timing 250, kernel stalls must be inserted between iterations of the inner loop (i.e., j=1, 2, . . . , e) in a case that p is not small compared to e. Such stalls compromise the performance of conventional Montgomery multiplier implementations.