1. Technical Field
The present invention relates to an apparatus comprising a plurality of arithmetic logic units.
2. Description of Related Art
A number of encryption/decryption algorithms exist which use simple integer arithmetic and logic instructions. These algorithms are typically characteristic in that they contain rounds of instructions and often have a high dependency on the result of the previous operation. It is desirable to implement these algorithms in a manner which is both fast but still relatively flexible.
An example of such an algorithm is the Multi-2 algorithm as described by ISO 9979/009 and U.S. Pat. No. 4,982,429 (the disclosures of which are hereby incorporated by reference). The Multi-2 algorithm describes encipher and deciphering through shifting bits.
Encryption/Decryption algorithms like Multi-2 typically use rounds. A round is a series of steps to be carried out on a block of data to be encrypted or decrypted. The round of instructions is repeated multiple times on the block of data before that block can be considered fully encrypted or decrypted.
Each step in the round may contain multiple instructions. Each instruction in the step is generally dependent on the result of the previous instruction in that step. The next step in the round will be dependent on the result of the last instruction in the previous round. In this manner encryption/decryption algorithms exhibit a high level of dependence.
Previously, these types of algorithms have been implemented in fixed hardware or on generic microprocessors.
Fixed Hardware is designed for the task of running a particular algorithm. Hardware designed for this single task is generally very small and fast. However, the algorithm is fixed and therefore fixed hardware can only be used for a single algorithm.
Generic microprocessors can be used to implement these algorithms, although they are much slower than implementing the algorithm through fixed hardware. However, implementing these algorithms on a generic microprocessor is desirable because it allows greater flexibility and various algorithms may be implemented through variations in the software. Generic microprocessors implement these algorithms using an instruction pipeline.
For purposes of discussion, reference is made to the “DLX” architecture for a microprocessor as discussed by Hennessy & Patterson in their book “Computer Architecture—a Quantitative Approach” (the disclosure of which is incorporated by reference).
FIG. 1 shows the DLX pipeline. The pipeline consists of five stages namely, Fetch 1, Decode 2, Execute 3, Memory 4 and Write-back 5. It will be appreciated that this FIG. 1 only shows those part of the pipeline of interest and a number of parts have been omitted for clarity.
The Fetch stage 1 consists of an output of an Instruction Memory 20 connected to an Instruction Fetch/Instruction Decode (IF/ID) block 30. The IF/ID block 30 is in both the Fetch 1 and Decode 2 stages. On the Decode stage 2 side, IF/ID block 30 is connected to a Registers block 40. The Registers block 40 is connected to an Instruction Decode/Execute (ID/EX) block 50 which is in both the Decode 2 and Execute 3 stages. The Registers block 40 is also connected to a Memory/Write-back (MEM/WB) block 90 which is in both the Memory 4 and Write-back 5 stages.
The Execute stage 3 side of the ID/EX block 50 has two operand outputs R1 and R2 and an operation code output opcode. The Execute stage 3 also contains an ALU 60. The ALU 60 has two operand inputs R1 and R2 from ID/EX block 50 and an operation code input opcode from ID/EX block 50. The output of ALU 60 is input into an Execute/Memory (EX/MEM) block 70 which is in both the Execute 3 and Memory 4 stage of the pipeline.
The Memory stage 4 side of the EX/MEM block 70 is connected to a Data Memory block 80. Data Memory block 80 is further connected to the MEM/WB block 90, which as previously discussed, is connected to the Registers block 40.
As mentioned above, the pipeline of FIG. 1 has five stages. Each stage is capable of operating simultaneously with the other stages. The operation of a typical pipeline as depicted in FIG. 1 is described below with reference to FIG. 2.
FIG. 2 shows the progression of instructions through the five stages of the pipeline depicted in FIG. 1 for each clock cycle. The pipeline begins at clock cycle 0 with four instructions A, B, C and D waiting in the instruction memory.
The Execute stage 3 of pipeline depicted in FIG. 1 consists of a single Arithmetic Logic Unit 60 with two operand inputs R1 and R2 and an opcode input. The Execute stage is only capable of carrying out one operation per clock cycle and instructions A, B, C and D only contain one operation each.
In clock cycle 1, instruction A is fetched from the instruction memory 20 during the Fetch stage 1. In the clock cycle 2, A moves to the Decode stage 2 and B is fetched from memory 20. In the Decode stage 2, operation code is extracted from A and its operands are determined. The operation code determines an operation which is to be carried out on the operands of A. The ALU 60 is typically capable of carrying operation such as AND, OR, NOT, XOR, addition, subtraction and bit shifting operations.
In the third clock cycle, A moves to Execute stage 3. B moves to Decode stage 2 while C is fetched from the instruction memory. An operand is extracted from B and operands are determined while in the Decode stage 2. In the Execute stage 3, A's opcode and two operands are input into ALU 60. ALU 60 performs and operation on the operands and outputs the result.
In the fourth and fifth clock cycle, A moves to the MEM stage and Write-back respectively. The remaining instructions progress through the pipeline in the same manner as described above until finally the result of instruction D is stored in the register file 40 during Write-back stage 5 in the eighth clock cycle.
The Multi-2 algorithm was designed to be implemented on a generic microprocessor. The algorithm may consist for example of N steps, some of the steps consist of multiple instructions and the algorithm may consist of N×M instructions in total. As discussed with reference to FIG. 2, the Execute stage 2 of the pipeline of FIG. 1 has only one ALU 60 and can only carry out one operation per clock cycle. Because this algorithm was designed for implementation on a generic processor such as that shown in FIG. 1, each instruction in the algorithm carries out only one operation.
FIG. 3 depicts an example of the instruction encoding of the pipeline of FIG. 1. Bits 31 to 26 of the encoded instruction of FIG. 3 contains an operation code OPCODE, bits 25 to 21 contain a destination register Rz, bits 20 to 16 contain a first operand Ra, bits 15 to 11 contain a second operand Rb and bits 10 to 0 are reserved for extensions.
Each instruction of the algorithm is fetched from the instruction memory and decoded by Decode stage 2. In Decode stage 2, the operation code and two operands are extracted from the instruction before it is executed. Each operation must move through all the stages of the pipeline of FIG. 1 similar to that of FIG. 2.
The Multi-2 algorithm implemented in this manner may take an average of seven instructions per round considering that a typical implementation of the algorithm requires 32 rounds to encrypt an 8-byte block, nearly 200 instructions are required to pass through the pipeline to encrypt the 8-byte block.
Several suggestions have been proposed in order to speed up the implementation of encryption/decryption algorithms, such as the above, on generic microprocessors.
Previously it has been suggested that the performance of a microprocessor may be improved by increasing the clock speed of the processor. Increasing the clock speed may be disadvantageous in terms of increased power dissipation and the need to interface across heterogeneous clock domain boundaries.
It has also been suggested that performance may be improved by parallelizing the instructions. When instructions are parallelized, multiple pipelines execute multiple instructions simultaneously, however algorithms with high dependency do not benefit from this because each calculation still depends on the result of the previous operation.
It has also been suggested that the inclusion of additional, customized instructions will increase performance. For example, Multi-2 may be improved with the inclusion of multi-operation instructions such as “Rotate A by 1 bit, add to A, and subtract 1” and “Rotate A by two bits, add to A, and add 1”. However these instructions are specific to a particular algorithm and any new algorithms will have to be implemented with the traditional single-operation instructions.