1. Field of the Invention
The present invention relates to processors and in particular to processors having long number calculating units, like they are for example required for cryptographic applications.
2. Description of the Related Art
DE 3631992 C2 discloses a method in which the modular multiplication may be accelerated via Z/NZ using a multiplication look-ahead method and using a reduction look-ahead method. The method described in DE 3631992 C2 is also referred to as the ZDN method and is described in more detail referring to FIG. 6. After an initial step 900 of the algorithm the global variables M, C and N are initialized. It is the object to calculate the following modular multiplication:Z=C*M mod N.
M is referred to as the multiplier, while C is referred to as the multiplicand. Z is the result of the modular multiplication, while N is the modulus.
In the following, different local variables are initialized, which are not explained in more detail for the time being. In the following, two look-ahead methods are applied. In the multiplication look-ahead method GEN_MULT_LA using different look ahead rules a multiplication shift value sz and a multiplication look-ahead parameter a are calculated (910). Hereupon, the current content of the Z register is subjected to a left shifting operation by sz digits (920).
Basically in parallel to this a reduction look-ahead method GEN_Mod_LA (930) is performed to calculate a reduction shift value sN and a reduction parameter b. In one step 940 the current content of the modulus register, i.e. N, is then shifted by sN digits to generate a shifted modulus value N′. The central three operand operation of the ZDN method is performed in a step 950. Hereby, the intermediate result Z′ is added to the multiplicand C which is multiplied with the multiplication look-ahead parameter a, and to the shifted module N′ which is multiplied with the reduction look-ahead parameter b after the step 920. Depending on the current situation, the look-ahead parameters a and b may have a value of +1, 0 or −1.
One case is that the multiplication look-ahead parameter a is +1 and that the reduction look-ahead parameter b is −1, so that the multiplicand C is added to a shifted intermediate result Z′, and the shifted modulus N′ is subtracted from the same. a will among others have a value equal 0 when the multiplication look-ahead method would allow more than a predetermined number of individual left-shifts, i.e. when sz is larger than the maximum admissible value of sz, which is also referred to as k. For the case of a being 0 and of Z′ still being relatively small due to the preceding modular reduction, i.e. the preceding subtraction of the shifted module, and in particular smaller than the shifted modulus N′, no reduction needs to take place, so that the parameter b is equal to 0.
The steps 910 to 950 are performed until any digits of the multiplicand have been operated, i.e. until m is equal to 0, and until also a parameter n is equal to 0, which indicates, whether the shifted modulus N′ is still larger than the original modulus N, or if despite of the fact that already any digits of the multiplicand have been operated, further reduction steps have to be performed by subtracting the modulus from Z.
Finally, it is determined, whether Z is smaller than 0. If this is the case, then the modulus N must be added to Z to obtain a final reduction, so that finally the correct result Z of the modular multiplication is obtained. In a step 960 the modular multiplication is terminated using a ZDN method.
The multiplication shift value sz and the multiplication parameter a which are calculated in step 910 by the multiplication look-ahead algorithm, are obtained by the topology of the multiplier and by the applied look-ahead rules which are described in DE 3631992 C2.
The reduction shift value sN and the reduction parameter b are determined by a comparison of the current content of the Z register to a value ⅔ times N, as it is also described in DE 3631992 C2. Based on this comparison the ZDN method has the name (ZDN=two thirds N).
The ZDN method, as it is illustrated in FIG. 4, returns the modular multiplication to a three operand addition (block 950 in FIG. 4), wherein for increasing the calculation time efficiency the multiplication look-ahead method and along with it the reduction look-ahead method may be used. Compared to the Montgomery reduction for Z/NZ a calculation time advantage by a factor in the range of 3 may be achieved.
For performing the three operand addition in block 950 of FIG. 4, for example the calculating unit illustrated in FIG. 3a may be used. The calculating unit shown in FIG. 3a includes a first partial calculating unit 30a and a second partial calculating unit 30b. The partial calculating units 30a, 30b consist of bit-slices stapled on top of each other, which are typically all implemented identically. A carry output of the topmost bit-slice of the calculating unit 30a is connected to a least significant bit-slice of the second partial calculating unit 30b via a configuration switch 32, via a carry output line 31. If the configuration switch is closed, i.e. the carry output of the topmost bit-slice of the first partial calculating unit 30a is fed into a carry input of the least significant bit-slice of the second partial calculating unit 30b, the calculating unit arrangement of FIG. 3a functions as a complete calculating unit whose length is equal to the number of bit-slices of the first partial calculating unit 30a plus the number of bit-slices of the second partial calculating unit 30b. 
It is to be noted that the arrangement of the two partial calculating units 30a and 30b in FIG. 3a may correspond to the geometrical arrangement of the bit-slices on an integrated circuit. Every bit-slice is set up identically, as it is shown in FIG. 3b. In particular, every bit-slice includes a register location Ci for storing the multiplicand C, a register location for a modulus register for storing the corresponding bits of the modulus N, a register location for storing the corresponding bit i of the intermediate result Z and two auxiliary registers CR1 a CR2, wherein when one bit-slice is considered, each register is “represented” by the bit of the corresponding order i. Finally, each bit-slice also includes an arithmetic unit AUi, comprising a carry input to obtain a carry from the lower stage, and which comprises a carry output to output a carry to the next stage. The carry input is also referred to as carry-in and the carry output is also referred to as carry-out. The line 31 is therefore on the one hand the carry output of the most significant bit of the first partial calculating unit 30a and on the other hand the carry input of the least significant bit-slice of the second partial calculating unit 30b. 
By stapling the bit-slices shown in FIG. 3b on top of each other, therefore a long number calculating unit with for example more than 2.048 bit-slices originates, wherein the long number calculating unit comprises the registers C, N, Z, CR1 and CR2 slice-internally, which are respectively more than 2.048 bits long.
Therefore, the long number calculating unit shown in FIG. 3a is able to quickly perform the three operand addition shown in block 950. For a multi-functional processor, which may for example be used as a crypto-coprocessor in a smart card, the desire exists that the same should also be able to perform an RSA calculation or an elliptic curve cryptography calculation for significantly shorter numbers. Thus, an RSA calculation with key lengths of 1.024 bits has a lower security than an RSA calculation with 2.048 bits. In cases, in which the security is, however, satisfactory for 1.024 bits, the calculating unit shown in FIG. 3a should also be applicable efficiently. For this purpose, the configuration switch 32 is provided to interrupt the carry path 31. In this case from the calculating unit of full length two partial calculating units 30a, 30b result, which, however, may agitate independently, for example to perform calculations in parallel which may be parallelized in a cryptographical algorithm. It is thereby achieved that two parallel 1.024 bit calculations may be performed using the calculating unit.
Frequently, an algorithm requires that the contents of the registers C, N, Z, CR1, CR2 are to be copied from one partial calculating unit 30a into the other partial calculating unit 30b or the other way round. For this purpose, an internal bus 33 is provided which provides a connection line to the bit-slice of the corresponding order of the other partial calculating unit for each bit-slice of the first partial calculating unit 30a. Thus, the complete content of a register in the first partial calculating unit 30a may be written into a register of the second partial calculating unit 30b in a quick way. Thus, in one cycle for example a 1.024 bit number in a register of the partial calculating unit 30a may be written into a register of the partial calculating unit 30b. In addition, further a data bus with a width n is provided (34), wherein the width n of the data bus is smaller than the number of bit-slices in a partial calculating unit 30a or 30b and is in particular much smaller and e.g. comprises only 8 bits. The data bus 34 is implemented, however, to be able to access any bit-slices and in particular any registers of the bit-slices of both partial calculating units 30a, 30b via multiplexers 35a, 35b. 
The importance of the interior bus 33 on the one hand and of the data bus 34 for a data transfer on the other hand is illustrated using a little number example. It is assumed that each partial calculating unit 30a, 30b comprises 1.024 bit-slices. If register data of a partial calculating network is written via the data bus 34 into a register of the other partial calculating network, then 128 cycles are required for this, which is definitely not acceptable, as a simple load command (load), a store command (store) or a register exchange command (exchange) would require an immense period of time. Therefore, the interior bus 33 is provided, through which the 1.024 bits may be transmitted in one cycle from-a calculating unit into the other.
The calculating unit shown in FIG. 3a therefore facilitates that either a calculating unit with full length or two parallel partial calculating units are operable independent of each other, wherein further a faster register exchange between the two partial calculation networks 30a and 30b is possible.
In particular with the elliptic curve cryptography the key lengths are significantly shorter and lie in the range of for example 130 bits. If it is assumed that both the first partial calculation network 30a is to work as an elliptic curve cryptography calculating unit, and the second calculating unit 30b is also to work as an elliptic curve cryptography calculating unit, only for example the topmost 130 bit-slices are required. This active part of the first partial calculating unit 30a is designated by 36a in FIG. 3a, while the active part of the second partial calculating unit 30b is designated with 36b. The remains of the two partial calculating units, i.e. the bit-slices which lie below the bit-slice i or the bit-slice k, respectively, are not required. When it is assumed, for example, that an overall partial calculating unit comprises 1.024 bits and that only the topmost 130 bits are required, then 87% of the register location of each partial calculating unit remains unused. In particular with integrated circuits for chip cards this is especially disadvantageous, as the memory resources on the chip card are anyway very restricted due to tight area limitations.
Thus, it would be conceivable to release the overall remains of the first partial calculating unit 30a as a register, although the arithmetic devices of the bit-slices are not needed, and to release the overall remains of the second partial calculating unit 30b as a register, although also here the arithmetic devices are not required. In this case still a faster register copy of a register of the first partial calculating unit into a register of the second partial calculating unit would be possible via the internal bus connections 33.
On the other hand, cryptographic algorithms which make do with shorter keys, like for example the elliptic curve cryptography (ECC), often have the characteristic that they require a great number of operands.
It would therefore be desirable to divide the non-active area both of the partial calculating unit 30a and also of the second partial calculating unit 30b, respectively, into further units, to define additional registers. These additional registers are realized by the bit-slice substaple designated by 37a, 37b or 38a, 38b, respectively. From FIG. 3a it may be seen that no opportunity for a broad-banded data exchange between the newly defined registers 37a and 38a is possible. Additionally, no broad-banded data exchange between for example the register 38a of the first partial calculating unit 30a and the register 37b of the second partial calculating unit 30b is possible. Such registers may only communicate with each other via the data bus 34, which is in particular disadvantageous in so far that the data bus 34 is narrow-banded. As an example, when for example the register 38a is 130 bits wide and the register 37b is also 130 bits wide, a data exchange from the register 38a to the register 37b via the data bus 34 would require 17 cycles, which is not tolerable due to time reasons. If the registers 38a and 37b are longer than 130 bits, then the number of cycles for a data exchange between these registers via the data bus 34 increases accordingly.
The further division of the non-active areas of the partial calculating units is therefore eliminated due to performance reasons or leads to such a substantial reduction of the performance features of a processor set up this ray if the register division is performed anyway. On the other hand it is not or only hardly acceptable due to the limited memory location limitations to leave a large amount of register memory area unused.