1. Field of the Invention
The invention pertains generally to computers. In particular, it pertains to processing elements used in a Montgomery multiplier.
2. Description of the Related Art
A number of applications, including public key data encryption/decryption such as the Rivest-Shamir-Adleman (RSA) algorithm, employ algorithms that use a combination of multiplication and modular reduction on large numbers. In particular, the RSA algorithm repeatedly implements the operation X Y mod Z (the remainder of X times Y divided by Z). When dealing with large numbers (such as 1024-bit numbers which are common in RSA-based data security), this algorithm includes at least two time-consuming operations (a multiplication and a division), and detection of a remainder. Montgomery multipliers are frequently used to implement this algorithm. Montgomery multipliers use a transform to perform X Y mod Z in a single operation by transforming X, Y, and Z into other numbers such as A, B, and M, performing a transformed operation, and reverse-transforming the result. With a long sequence of multiplies, if the transformed operation is less time consuming than the actual operation, the total time to determine a final answer can be significantly less with the Montgomery multiplier.
Some Montgomery multipliers use a linear systolic array, i.e., a chain of identical processing elements (PEs), with each PE working on a portion (for example four bits) of each of the numbers involved. The chain contains enough PEs to hold the largest of the numbers involved, including interim results. Interim information is fed in both directions between adjacent PEs during the operation.
In a conventional linear systolic array Montgomery multiplier (LSAMM), two of the three parameters (typically B and M), are pre-loaded into the PEs through buses connected to every PE. The connections needed to route multi-bit buses to hundreds of PEs adds considerable complexity to the circuit. A conventional LSAMM also incorporates a decoding circuit in every PE to decode a multi-bit carry operation. Each PE must include circuitry dedicated to performing the bus interface and carry decode operations, so the total circuitry required for bus operation and multi-bits carries can be determined by multiplying this extra logic by the number of PEs in the LSAMM. Since Montgomery multipliers are frequently implemented in a field programmable gate array (FPGA) with a fixed amount of available circuitry, all that extra logic consumes many gates that could otherwise be used to increase the number of PEs, or could be devoted to other functions.
Multiples of at least one parameter must be loaded into a conventional Montgomery multiplier before a Montgomery multiplication begins. Pre-calculating these multiples of very large numbers in software is a time-consuming process that slows down any operation requiring Montgomery multiplications.