1. Field of the Invention
The present invention relates to a method and system for performing permutations of a sequence of bits in a programmable processor.
2. Description of the Related Art
The need for secure information processing has increased with the increasing use of the public internet and wireless communications in e-commerce, e-business and personal use. Typical use of the internet is not secure. Secure information processing typically includes authentication of users and host machines, confidentiality of messages sent over public networks, and assurances that messages, programs and data have not been maliciously changed. Conventional solutions have provided security functions by using different security protocols employing different cryptographic algorithms, such as public key, symmetric key and hash algorithms.
For encrypting large amounts of data symmetric key cryptography algorithms have been used, see Bruce Schneier, “Applied Cryptography”, 2nd Ed., John Wiley & Sons, Inc., 1996. These algorithms use the same secret key to encrypt and decrypt a given message, and encryption and decryption have the same computational complexity. In symmetric key algorithms, the cryptographic techniques of “confusion” and “diffusion” are synergistically employed. “Confusion” obscures the relationship between the plaintext (original message) and the ciphertext (encrypted message), for example, through substitution of arbitrary bits for bits in the plaintext. “Diffusion” spread the redundancy of the plaintext over the ciphertext, for example through permutation of the bits of the plaintext block. Such bit-level permutations have the drawback of being slow when implemented with conventional instructions available in microprocessors and other programmable processors.
Bit-level permutations are particularly difficult for processors, and have been avoided in the design of new cryptography algorithms, where it is desired to have fast software implementations, for example in the Advanced Encryption Standard, as described in NIST, “Announcing Request for Candidate Algorithm Nominations for the Advanced Encryption Standard (AES)”. Since conventional microprocessors are word-oriented, performing bit-level permutations is difficult and tedious. Every bit has to be extracted from the source register, moved to its new location in the destination register, and combined with the bits that have already been moved. This requires 4 instructions per bit (mask generation, AND, SHIFT, OR), and 4n instructions to perform an arbitrary permutation of n bits. Conventional microprocessors, for example Precision Architecture (PA-RISC) have been described to provide more powerful bit-manipulation capabilities using EXTRACT and DEPOSIT instructions, which can essentially perform the four operations required for each bit in 2 instructions (EXTRACT, DEPOSIT), resulting in 2n instructions for any arbitrary permutation of n bits, see Ruby Lee, “Precision Architecture”, IEEE Computer, Vol. 22, No. 1, pp. 78–91, January 1989. Accordingly, an arbitrary 64-bit permutation could take 128 or 256 instructions on this type of conventional microprocessor. Pre-defined permutations with some regular patterns have been implemented in fewer instructions, for example, the permutations in DES, as described in Bruce Schneier, “Applied Cryptography”, 2nd Ed., John Wiley & Sons, Inc., 1996.
Conventional techniques have also used table lookup methods to implement fixed permutations. To achieve a fixed permutation of n input bits with one table lookup, a table with 2n entries is used with each entry being n bits. For a 64-bit permutation, this type of table lookup would use 267 bytes, which is clearly infeasible. Alternatively, the table can be broken up into smaller tables, and several table lookup operations could be used. For example, a 64-bit permutation could be implemented by permuting 8 consecutive bits at a time, then combining these 8 intermediate permutations into a final permutation. This method requires 8 tables, each with 256 entries, each entry being 64 bits. Each entry has zeros in all positions, except the 8 bit positions to which the selected 8 bits in the source are permuted. After the eight table lookups done by 8 LOAD instructions, the results are combined with 7 OR instructions to get the final permutation. In addition, 8 instructions are needed to extract the index for the LOAD instruction, for a total of 23 instructions. The memory requirement is 8*256*8=16 kilobytes for eight tables. Although 23 instructions is less than the 128 or 256 instructions used in the previous method, the actual execution time can be much longer due to cache miss penalties or memory access latencies. For example, if half of the 8 Load instructions miss in the cache, and each cache miss takes 50 cycles to fetch the missing cache line from main memory, the actual execution time is more than 4*50=200 cycles. Accordingly, this method can be longer than the previously described 128 cycles using EXTRACT and DEPOSIT. This method also has the drawback of a memory requirement of 16 kilobytes for the tables.
Permutations are a requirement for fast processing of digital multimedia information, using subword-parallel instructions, more commonly known as multimedia instructions, as described in Ruby Lee, “Accelerating Multimedia with Enhanced Micro-processors”, IEEE Micro, Vol. 15, No. 2, pp.22–32, April 1995, and Ruby Lee, “Subword Parallelism in MAX-2”, IEEE Micro, Vol. 16, No. 4, pp.51–59, August 1996. The MAX-2 general-purpose PERMUTE instructions can do any permutation, with and without repetitions, of the subwords packed in a 64-bit register. However, it is only defined for 16-bit subwords. MIX and MUX instructions have been implemented in the IA-64 architectures, which are extensions to the MIX and PERMUTE instructions of MAX-2, see Intel Corporation, “IA-64 Application Developer's Architecture Guide”, Intel Corporation, May, 1999. The IA-64 uses MUX instruction, which is a fully general permute instruction for 16-bit subwords, with five new permute byte variants. A VPERM instruction has been used in an AltiVec extension to the Power PC™ available from IBM Corporation, Armonk, N.Y., see Motorola Corporation, “‘AltiVec Extensions to PowerPC’ Instruction Set Architecture Specification”, Motorola Corporation, May 1998. The Altivec VPERM instruction extends the general permutation capabilities of MAX-2's PERMUTE instruction to 8-bit subwords selected from two 128-bit source registers, into a single 128-bit destination register. Since there are 32 such subwords from which 16 are selected, this requires 16*lg32=80 bits for specifying the desired permutation. This means that VPERM has to use another 128-bit register to hold the permutation control bits, making it a very expensive instruction with three source registers and one destination register, all 128 bits wide.
It is desirable to provide significantly faster and more economical ways to perform arbitrary permutations of n bits, without any need for table storage, which can be used for encrypting large amounts of data for confidentiality or privacy.