The invention relates generally to pseudo-random number generation, and more particularly to pseudo-random number generators efficiently executing on processors having instruction-level parallelism for generating enciphered data streams.
Many existing cipher designs stem from an era when processors exhibited little or no instruction parallelism or concurrency. The route to achieving the ultimate in performance for software based encryption algorithms for these processors was to reduce to a minimum the total number of operations required to encrypt each symbol. Examples of algorithms that epitomize this strategy are RC4, SEAL, and WAKE.
However, recent microprocessors gain much of their performance improvement over their predecessors by having a greater degree of instruction-level parallelism. Instruction-level parallelism is achieved, in part, through pipelining the execution unit or units, and in part through architectures that incorporate multiple execution units or multiple data paths that operate concurrently. Examples include superscalar, very-long-instruction-word (VLIW), and single-instruction multiple-data (SIMD) processor architectures.
Several recent cryptographic pseudo-random number generators claim to have been designed for efficient software implementation. However RC4, SEAL, and WAKE, which are three of the fastest known cryptographic pseudo-random number generators, have been analyzed and found to be incapable of efficiently exploiting the instruction-level parallelism extant in the microprocessor for which the analysis was performed and in other processors with similar instruction-level parallelism.
A known technique for introducing additional parallelism into the encoding process is that of interleaving the outputs of several generators that run concurrently. While this technique has obvious application to parallel hardware implementations, and to software implementations on multi-processors, its effectiveness on single processors having substantial instruction-level parallelism has not been demonstrated, and this method can result in reduced performance on single processors that have few internal registers as is typical of some legacy processor architectures.
When the outputs of multiple instances of a generator are interleaved, it is necessary to maintain state information for each instance of the generator. For generators that do not naturally exploit parallelism to take advantage of a processor's instruction-level parallelism, the concurrent implementation of the generators must be accomplished by interleaving their implementations on an instruction-by-instruction basis. For this method to be efficient, the processor should have enough internal registers to simultaneously hold the states of all the generators, or else there will be a performance penalty due to having to frequently save and restore generator states from memory. Also, if the generators make use of look-up-tables, as is common in fast cryptographic pseudo-random number generators, then multiple generators may dictate the need for multiple look-up-tables, which by increasing the amount of data needed to be held in cache memory can adversely affect performance.
On processors having little instruction-level parallelism, there is nothing to be gained by instruction-level interleaving of multiple generators. Therefore, an alternative strategy, that reduces the amount of state information in use at any one time to that of just one generator, while maintaining a compatible interleaved data stream, is to interleave the generators on a coarse-grained, or block-by-block, basis. However, the required fine grained interleaving of the data stream then dictates a less efficient data access pattern which, most likely, will result in lost performance through increased cache misses, whether the interleaving is performed on-the-fly or as a second pass over the data.
It is an object of the invention to achieve, on processors having sufficient instruction-level parallelism, substantially better efficiency and throughput than previously disclosed cryptographic pseudo-random number generators.
Advantageously, the ability of the invention to exploit instruction-level parallelism does not penalize its ability to also operate efficiently on older processors that have little instruction-level parallelism and those that have few internal registers. In particular, the invention does not use substantially more registers or cache memory than other fast algorithms that are poor at exploiting instruction-level parallelism.
Compared to interleaving multiple traditional generators, the invention further, advantageously, requires less state information to be maintained, which in a software implementation uses fewer registers and/or less cache memory, thus making it efficient on a broader spectrum of processors. Further, by having only one generator, the invention avoids the inconvenience and data overhead of needing to specify initialization parameters for multiple generators. (Note also that interleaved generators also suffer a performance penalty from replication of the generator initialization procedure.) Another difficulty is the need for the outputs of the multiple generators to be decorrelated from one another, which can be non-trivial to guarantee when all are essentially copies of the same state machine. A single generator, according to the invention, advantageously does not suffer these drawbacks.
The invention is advantageously amenable to the technique of interleaving and the amount of instruction-level parallelism which can be exploited is thus further increased.
The invention, advantageously, is efficient across a broad spectrum of processors, including those that have substantial instruction-level parallelism and those that have few internal registers. It is particularly advantageous to applications in which an encryption protocol is to be standardized while leaving a wide range of price/performance options available for the equipment that will implement the protocol. Such choice is especially advantageous where the encryption protocol will persist into the future in which the best price/performance trade-offs are as yet unknown.