1. Field of the Invention
The present invention relates to calculating units and in particular to secure and fast calculating units as they may be employed in cryptography processors and in particular, in adder circuits for cryptography processors.
2. Description of Related Art
DE 3631992 C2 discloses a cryptography processor for performing the RSA public key crypto system. Here, a modular exponentiation with a base, an exponent, and a module is broken down into a plurality of three operand additions. The three operands include a module operand N, a multiplicand operand C, and an intermediate result operand Z. By a corresponding shifting/weighting of the three operands before the addition a multiplication/reduction accelerated by a multiplication look-ahead algorithm and a reduction look-ahead algorithm may be performed.
FIG. 7 shows a section from the adder unit which so to speak illustrates the heart of the cryptography processor illustrated in DE 3631992 C2. In detail, FIG. 7 shows two consecutive bit slices to calculate the two sum bits i−1 and i, that is from the three input operand bits Ci, Ni, Zi or Ci−1, Ni−1, Zi−1 or from Ci−2, Ni−2, and Zi−2, respectively.
The three operand addition of C, N, Z is broken down into a two-stage operation, regarded on the bit level. For executing the first stage of the operation, a three-bit half adder 80 is provided, wherein each three-bit half adder 80 has a downstream two-bit full adder 81. The three-bit half adder provides two output bits xi, yi, wherein the output bits xi, yi are fed into the downstream two-bit full adder as it is shown in FIG. 7. In particular, in each two-bit full adder of a bit slice, the low order bit yi at the output of the three-bit half adder is combined to the high order bit of the three-bit half adder stage (xx−1), which is lower by one order in the two-bit full adder 81, in order to calculate a sum bit 82 and a carry bit 83. The three operand addition is thus broken down into two sections. In the first section, at each binary digit a sum of the tree bits of the operands is formed. The sum may take the values of 0 to 3 (in a decimal presentation). The sum may also be presented binarily with the two bits x, y. As the sum is formed at each digit, from the two sum bits two new numbers may be put together.
In the second section, the two numbers are added in a conventional way by the two-bit full adder 81. The interconnection such that a two-bit full adder always receives two output bits from two different three-bit half adders as an input leads to a lengthening of the calculating unit by one bit.
The three operand adder shown in FIG. 7 is problematic in so far that neither a backup of input operands C, N, Z nor a backup of the “intermediate operands” x, y is provided. This is problematic in so far that in particular in the normal case in which all circuits are implemented in a CMOS logic, switching a bit from 0 to 1 or from 1 to 0, respectively, leads to a current pulse which starts when a bit state is switched. As it is known, CMOS circuits consume no current in the static state. However, in the switching state they consume current. This current consumption can be determined by a power analysis. This way, it is basically possible to derive information via C, N, Z, for example, to draw conclusions to the secret key which is used in an RSA operation.
An attacker could, for example, determine by detecting the current profile, whether a switching from 0 to 1 or from 1 to 0 took place. With an unsecured circuit, a switching of a bit would always occur when a current peak may be detected in the current profile. An attacker may thus understand the complete switching performance of a calculating unit by use of the current profile. The attacker would then only need one single bit in a complete sequence to be able to reconstruct whether a switching from a “1” to a “0” took place, or vice versa.
Certain CMOS circuits further have the characteristic that switching from 0 to 1 results in a different current consumption than switching from 1 to 0. In this case, by comparing two different current peaks, an attacker directly sees which bits were processed in the calculating unit.
As a defense measure against such power analysis attacks, it was proposed to use a so-called dual rail technology. In the dual rail technology in principle each signal path is implemented in double. On the first signal path, for example, a signal x is processed normally. On the second signal path which is integrated in the same chip not the signal x is processed but the complementary signal x. This leads to the fact that always when a transition for example from 0 to 1 takes place in the signal line, in the other line, i.e. the second “rail”, a complementary transition takes place. For each bit transition thus on both lines always both transitions take place. This leads to the fact that for circuits in which transitions from 0 to 1 and from 1 to 0 consume a different amount of current, it may not be determined any more, whether a transition from 0 to 1 or from 1 to 0 took place. The reason for this is that the current profile for each circuit transition contains a peak which is the overlaying of the current consumptions of the two rails. The dual rail technology provides a high security, with the disadvantage, however, that all circuits usually have to be implemented in double and that the power consumption of the complete circuit is also double as high. The circuit is, however, to a certain extent immune against power analysis attacks.
If only a dual rail technology is used, then by use of the current profile it may still be seen whether a certain bit changed from 0 to 1 or from 1 to 0, respectively, or remained the same as compared to the preceding clock cycle. In case of a bit transition, a power peak may be seen. The power peak may not be seen, however, when a bit remained for example at 1 or at 0 from cycle to the next, i.e. did not change. For defending against attacks which are based on this effect, it was proposed to supplement the dual rail technology by a precharge/predischarge mode. The circuit is alternatingly operated in a data mode and in a preparation mode (precharge/predischarge). Each data cycle has a preceding preparation cycle in which in case of precharge both rails, i.e., for example, x and x are precharged to “1”, in order to then, in the data mode, feed complementary input signals to be processed into both rails. This leads to the fact that always, from data cycle to a preparation cycle or from a preparation cycle to a data cycle, exactly the same number of transitions takes place. If the preparation mode is implemented as a predischarge mode, then in the preparation mode all input data is “predischarged” to 0 and not initialized to 1 like in precharge. Then again from a preparation cycle to a data cycle and vice versa exactly the same number of transitions takes place.
As it was indicated above, for the implementation of modular operations, like e.g., addition or multiplication, for example, within the scope of cryptographic algorithms, like RAS or elliptic curves, a three operand adder is required. For the different reasons, these operations have to be performed by the adder unit securely against power attacks. As cryptographic calculations are extremely computing-intense, the adder unit has to have a high power. As in particular in cryptography long operands have to be processed, wherein the operand length with elliptic curves is in a range between 100 and 200 bits and in a range of RSA from between 1024 and 2048 bits, the calculating unit itself has a large bit length in order to meet the requirements set for the calculating unit regarding speed.
In order to make the calculating unit shown in FIG. 7, which shows a section of two bit slices of a larger calculating unit for example having 2300 bit slices, more secure against cryptographic attacks, it is first of all necessary to supply the input operands C, N, Z to the three-bit half adders 80 in dual rail technology.
An even higher security is not only to apply a dual rail technology but a dual rail technology including precharge or predischarge, respectively. Here, a data clock always alternates with a so-called preparation clock. In a data clock, for example, the bit Ci is equal to 0 or 1, while the complementary bit which is supplied on the “second rail” and designated by Ci is complementary to the bit Ci. In a preparation clock trailing the data clock, in case of a precharge as a preparation mode a charging of both lines for example to Vdd is performed, which might for example correspond to the logic state of “1”. Both dual rail lines have thus the same value in the preparation mode. In case of a discharge, both lines Ci and Ci might be brought to the potential of Vss, which typically is ground potential, wherein ground potential in the present example corresponds to a logical “0”. Of course, Vss may also correspond to a logical “1”. Then, Vdd would correspond to a logical “0”.
Logically, then the bits xi and yi or xi−1 and yi−1, respectively, would also have to be supplied in dual rail technology with precharge to the corresponding downstream 2-bit full adders 81, wherein again always one data clock follows one preparation clock. On the output side, then the carry bits C 83 and the sum bits 82 would also be led out of the 2-bit full adder in dual rail technology with precharge/predischarge.
Thus, it is not sufficient for an optimum security, however, that the input lines into the elements 80 and 81 and the output lines from elements 80 and 81 are implemented in dual rail technology. Instead, it is further important that also the circuits 80, 81 themselves are implanted as dual rail circuits. This may be achieved by the fact that the 3-bit half adder 80 and each 2-bit full adder 81 are present in double and implemented so that the first 3-bit half adder 80 for example calculates using the non-inverted operand bits Ci, Ni, and Zi in order to obtain xi and yi. The second 3-bit half adder 80 for the same bit slice would then operate using the inverted bits Ci, Ni, and Zi, for example to obtain the inverted output bits xi and yi. Of course, any cross-combinations are possible, so that the first 3-bit half adder is implemented to calculate the inverted output bits from the non-inverted input bits and vice versa.
The same would have to be performed for the 2-bit full adder 81 in order to obtain an optimally secure circuit in a bit slice of a long-number calculating unit for an RSA calculation with module lengths of 1024 bits, 2048 bits, etc. Such calculating units are required for performing modular operations, like e.g. addition and multiplication, for example within the scope of cryptographic algorithms like RSA or elliptic curves. A 3 operand addition performed by the 3 operand adder, shown in part in FIG. 7, consists in adding a first addend C to a second addend Z and finally to add or subtract the module N depending on the look-ahead specification. These operations have to be performed by the adder unit secure with respect to power attacks, as it was discussed.
It is obvious that cryptographic calculations are extremely computing-intense. Thus, apart from security, there is also the requirement with regard to the calculating unit to have a high power, in so far, that it needs as little calculating time as possible, or in case of limited resources, like for example on a chip card, may still perform highly secure operations using an acceptable calculating time.
As the arithmetic of cryptographic operations is a long-number arithmetic, the calculating unit itself, as it was discussed, has a great bit length in order to reach a required power at all. Thus, it is assumed, that each bit slice corresponds to a digit of a module in so far that all digits are “mapped” in corresponding hardware bit slices. Only this maximally parallel implementation guarantees a sufficient performance in most cases.
With regard to the requested maximal parallelism and on the other hand, with regard to the costs for the chip which includes such a calculating unit, and also with regard to other limits applied to the chip area, it is strictly necessary to design the calculating unit as very space-saving. In principle, thus a high-power calculating unit is required which is apart from that secure and requires a minimal area.
The German patent DE 10307942 B3 discloses a half adder for adding bits of at least two input operands to obtain at least two output bits. Apart from an input stage, the half adder includes a plurality of switching stages and each switching stage has a downstream output stage. In a data mode, the switching stages are operable to either provide a bit or an inverted bit at an output to the output stage which is downstream from a corresponding switching stage. The output stage then changes an output bit as compared to a preceding preparation mode on the basis of the bit received from the corresponding switching stage and supplements the complementary bit from the preceding preparation mode. This half adder circuit already is space-efficient as it may operate without a complete dual rail logic and is still secure against cryptographic attacks. In particular, the input stage and the output stage are implemented in a complete dual rail circuit technology with precharge or predischarge, while the switching stages for the individual bits are implemented such that they switch a calculating potential through to the output of the switching stage according to a half adder regulation for the output bit depending on the bits of the input operands, wherein the calculating potential at the output represents the output bit or an inverted version of the output bit. This switching stage is not implemented in dual rail technology but operates according to a “one-hot” realization.
Although this circuit already is space-efficient, as the switching stages are not implemented in a complete dual rail logic, and although this circuit is energy-efficient in so far that the switching stage always respectively only calculates the inverted bit or the bit itself, i.e., consumes no current for both bits, it has turned out anyway that there may be problems in the transition from a preparation mode to a data mode.
Thus, the switching stage, depending on the occupation of its inputs, switches two VDD paths through onto the bit or the inverted bit BitQ of its two outputs, so that a logically valid number results. The switching stage has internal nodes, however, due to its special one-hot realization, which are decoupled, i.e., so to speak float in the preparation mode, i.e., when on the input side and on the output side a precharge operation or a predischarge operation is present. If such an internal node was on a high potential, i.e. charged, in the preceding data mode, then it will not automatically and immediately loose its charge due to fact that it is floating in the preparation mode. Thus it may happen, that a VDD path is switched e.g. onto the bit line of one of the result bits and that an internal residual charge which the floating node has is switched onto the inverted bit of the same result bit.
By this, a situation results in which the driven bit first has to fight against the residual charge on BitQ. Here, a cross current flows and the circuit becomes slower than necessary.
It further turned out, that this effect may be substantially increased by unavoidable coupling capacities, so that under especially unfavorable conditions not the driven bit but the residual charge so to speak wins and “turns around” the driven bit. This even leads to the fact that the calculated result is wrong.