The present invention relates to the calculation of the modular multiplication, such as it is required, in particular, in cryptographic applications, and particularly to a calculation of the modular multiplication with operands having a larger number of digits than bit slices are present in a long number calculating unit.
The modular multiplication is a central operation used in modular exponentiation, such as it is usually used in cryptography. For example, as shown in FIG. 2a, a key pair is generated in public key cryptography, i.e. in asymmetric cryptography, such as in the RSA method. The key pair consists of a public key e and a private key d. The private key is only known to one entity. The public key serves this entity, but is provided to another entity which wants to send, for example, encrypted data to the one entity to which the private key belongs. As shown in FIG. 2a, an encryption of an unencrypted message M to an encrypted message C is done by calculating a so-called modular exponentiation, in which the message is raised to a higher power with the public key, to then perform a modular reduction with respect to the modulus N, which is also known publicly. For the decryption, the same operation is performed, but now with the private key as exponent, so that the one entity to which the private key belongs and by which the public key was originally distributed to the other entity, again obtains the plain text message M.
These public key methods may also be used as signature/verification methods. An entity generates a digital signature by encrypting the message M to be signed with the private key of this entity to generate the signature S, such as it is also illustrated in FIG. 2a. The verification is then done by the verifying entity subjecting the signature to modular exponentiation with the public key e of the signing entity to then obtain a plain text message M that may be compared to the plain text message M to which the signature is assigned. If the plain text message obtained in the verification matches the plain text message to which the signature is assigned, it may be assumed that the signed document is authentic.
As mentioned above, a cryptographic calculation including modular exponentiation, such as illustrated in FIG. 2b, is split into several modular multiplications. For example, it is usually preferred to calculate a modular exponentiation by applying modular multiplications consecutively. In particular, due to the increased security requirements for the RSA algorithm, there is an interest to execute a modular multiplication with a width of 2048 bits, i.e. with key lengths and/or modulus lengths of 2048 bits.
Generally in modular multiplication as part of a cryptographic calculation, both the multiplier A and the multiplicand B and the modulus N represent parameters of the cryptographic calculation, because the final results, such as plain text message, encrypted message, signature, etc. depend on these parameters.
As already mentioned, there is an interest to steadily increase the key lengths of public key cryptography, because this allows to still prevent so-called brute force attacks with increasingly fast processors. For example, the effort of a brute force attack is correlated with the key length, so that increasingly long keys also require increasingly more complex brute force attacks which, with currently available computers, take so much time that a cryptographic algorithm may be considered to be safe. However, what is problematic with increasingly larger key lengths is that the key length that a crypto co-processor in a chip card or a computer (for example in a TPM module) has is limited by the long number calculating unit included in this crypto co-processor. Such a long number calculating unit is shown, for example, in FIG. 4c, where a so-called bit-slice structure of a long number calculating unit is illustrated.
In the embodiment shown in FIG. 4c, each bit slice includes an arithmetic unit, which may, for example, be a one-bit full adder, which may receive a carry from a lower bit slice and which may output a carry to a higher bit slice. Furthermore, at least one register is associated with such a bit slice. However, it is preferred to associate a certain number of registers, for example two or, even better, for example five registers. In a currently existing crypto co-processor with a bit slice number of 1408 slices, a bit slice includes five registers, i.e. register Z, register C, register N, register CR0 and register CR4, as indicated in the left subimage in FIG. 4a. In that case, this processor operates in long mode. With this number of bit slices, the processor is well-suited to perform RSA calculations with key lengths of 1024 bits, because, for a calculation with 1024 bits key length, a calculating unit that would also have only 1024 bit slices would not be quite sufficient. In the calculating unit with 1408 bit slices, slightly longer key lengths may also be calculated, but there should always be slightly more bit slices than key bits to be able to compensate certain overflow or underflow situations.
The calculating unit 40 shown in FIG. 4b may be provided with data and/or flow sequences and/or controlled by a controller 41. Furthermore, there is a register configuration means 42 which may configure the registers of the calculating unit, i.e. the five registers in long mode in this embodiment, to ten registers in short mode. Each long mode register of a certain length thus results in two short registers of half the length, respectively, in this embodiment, so that two N registers, two C registers, two Z registers and one CR0 register, one CR2 register, one CR4 register and one CR6 register are created. Still each bit slice has an arithmetic unit, i.e. for example a one-bit full adder, which now, however, has twice the number of registers in short mode in contrast to the situation in FIG. 4c representing the long mode.
If the crypto co-processor with 1408 bits now is to calculate RSA key lengths of, for example, 2048 bits, this is no longer easily possible, because there are not enough bit slices.
It is apparent that, although an increase in key lengths is very desirable from the security point of view, each increase in key lengths causes already existing coprocessors to be no longer readily usable. Thus, always new longer calculating units would have to be developed, which requires development time and costs.
In order to avoid this, methods have been developed with which larger numbers may be processed on smaller calculating units. For example, there are generally methods for doubling a calculating unit in software. Such a method is, for example, the calculation of the modular multiplication using the Chinese Remainder Theorem (CRT), as it is described in section 14.5 on pages 610-613 of “Handbook of Applied Cryptography”, A. Menezes, P. van Oorschot, S. Vanstone, 1996. Generally, a modular exponentiation with a long modulus is split into two modular exponentiations with a short modulus using the Chinese remainder theorem, wherein these results are then combined. In that way, a calculating unit may, so to speak, be doubled “software-wise”.
However, this concept only allows doubling, which is inconvenient for situations in which doubling of the key lengths is not necessarily required, but in which key lengths are to be used that are maybe only 50% larger than the architectural calculating unit length, i.e. the number of bit slices. If such 100% doubling algorithms are used, when perhaps only key lengths larger by 50% are to be processed, the calculating unit is used only with (100+50) %/2=75%. In principle, hardware resources are thus wasted.
In addition to the CRT doubling method, there are also further calculating unit doubling algorithms, such as the Montgomery multiplication, a multiplication with Karatsuba-Offman and subsequent reduction by means of, for example, the Barrett reduction, or the doubling method using the MultModDiv operation, such as it is, for example, discussed in German patent DE 10219158 B4.
Considering, for example, FIG. 4d, a calculating unit for a 1024 bit key length is indicated at 43. Software doubling using, for example, the Chinese remainder theorem or using one of the above further methods, is useful when 2048 bits are required, such as illustrated in block 44 in FIG. 4d. In this way, the whole calculating unit is used, i.e. no unused bit slices remain. However, if a key length with, for example, 1536 bits is to be enough, software doubling using, for example, the Chinese remainder theorem (CRT) will result in 2×768 bits being required. The remaining 2×256 bits would remain unused in this case.
Not only regarding the utilization of the calculating unit, but also when register lengths of a calculating unit cannot be divided, the calculating unit will always have a maximum processing capacity also including intermediate results of the calculation. In other words, the word length that may maximally be processed by the calculating unit will be limited. This word length is, for example, the total length of the register memories, i.e. the register cells associated with the individual bit slices. Alternatively, the maximally processable word length is also an effective word length larger than the register length which is achieved by various software/hardware concepts. In general, however, the situation will be such that this maximally processable word length does not only apply to input operands or the result of the calculation, but also to all intermediate results within the calculation, which are stored in a memory and which will then be required again for a later calculation on the path from the input operand to the result.
If, within a calculation, there is a situation in which an intermediate result becomes larger than the maximally processable word length, this results in a significant slowing down of the processor, because in that case auxiliary routines have to be started, which require special measures for this intermediate result, such as more sophisticated measures of further enlargement of the calculating unit by further increase of the number of operand fragments. Alternatively, such a situation may even result in an error and/or in an output to the user that there has been an overflow and thus the calculation desired by the user cannot be performed and/or smaller input operands have to be supplied by the user.
Even if such an “overflow” does not result in a computing abort or in a calculating error, it still results in a significant slowing down when corresponding service routines have to be started to manage this exceptional situation.
Conventionally, there is thus a lack of a more flexible and faster calculating unit concept.