The present invention relates to the field of computers. More specifically, the present invention relates to computer arithmetic. Modular multiplication and squaring of large integers are the single most performance-critical operations for the Rivest, Shamir, Adleman (RSA) algorithm and elliptic curve cryptography (ECC). Therefore, high-performance implementations focus specifically on optimizing this operation. Specifically, on many processors, multiple-precision multiplication of large integers not only involves arithmetic operations, but due to limited register space also has significant amount of data transport to and from memory. Assuming schoolbook-multiplication of m-bit integers on a device with a word size of k bits, m-bit integers have to be divided into n-word operands, where
  n  =            ⌈              m        k            ⌉        .  To compute an m×m-bit multiplication, the number of k×k-bit multiplication operations is fixed to n2 and possible reduction of the number of additions is limited. Therefore, computation time can mainly be optimized by reducing the number of non-arithmetic operations and specifically memory operations.
Conventional techniques for performing multiple-precision multiplication include row-wise multiplication and column-wise multiplication. When multiplying two multi-word integers A=(an−1, . . . , a1, a0) and B=(bn−1, . . . , b1, b0), row-wise multiplication keeps the multiplier bi constant and multiplies it with the entire multiple-precision multiplicand (an−1, . . . , a1, a0) before moving to the next multiplier bi+1. Partial products are summed up in an accumulator consisting of n registers (rn−1, . . . r1, r0), each of bit width k. Upon completion of a row, the last register of the accumulator (r0 for the first row) can be stored to memory as part of the final result and can be reused for accumulation of the next row. Two registers are required to store the constant bi, and one variable aj. In the described implementation, row-wise multiplication requires n+2 registers and performs n2+3n memory accesses. That is, for each k×k multiplication one memory load operation is needed. On processor architectures that do not have sufficient register space for the accumulator, up to n2+1 additional load and n2−n additional store operations are required. On the other hand, processors that can hold both the accumulator and the entire multiplicand in register space can perform row-wise multiplication with 2n+1 registers and only 4n memory accesses. In addition to memory accesses, pointers to multiplicand, multiplier and result may have to be adjusted on implementations using indexed addressing. If multiplicand and multiplier are indexed, one pointer increment/decrement is needed for each load operation. FIG. 4 depicts an illustrative example of row-wise multiplication with a multiplicand 401 and a multiplier 402.
Column-wise multiplication sums up columns of partial products aj*bi, where i+j=ν for column ν. A description of column-wise multiplication can be found in Exponentiation Cryptosystems on the IBM PC, IBM Systems Journal, Vol. 29, Issue 4, pages 526-538 (1990) by P. G. Comba. FIG. 6 depicts an illustrative example of column-wise multiplication with a multiplicand 601 and a multiplier 602. At the end of each column, one k-bit word is stored as part of the final multiplication result. Column-wise multiplication requires 4+┌ log2(n)/k┐ registers, thus can be implemented on a platform with limited register space. However, 2n2+2n memory operations have to be performed, which correspond to approximately two memory load operations per k×k multiplicand aj and multiplier bi once for every k×k-bit multiplication.