The Mersenne Twister is a well known procedure applicable for pseudo random number sequence generation. It is capable of generating high-quality random numbers with long period at a high speed. Among the several variants of the Mersenne Twisters, the MT19937 generator is recognized to be superbly reliable and thus widely used for various scientific and commercial purposes. The MT19937 generator is based on a recursion that can be represented as xi+N=f(xi, xi+1, xi+M), wherein N=624 and M=397 as defined by the algorithm, where the (i+N)th number is a function of the ith, (i+1)th, (i+M)th. One common implementation of the MT19937 generator generates 624 numbers of 32-bit length with a pseudo random distribution during each iteration. A succeeding iteration relies on the results computed during a preceding iteration.
The fact that a MT19937 generator usually demands large memory space to store the state elements and frequent interactions among the state elements makes parallel processing challenging because inter-thread communications usually carry heavy overheads. Conventionally MT19937 programs only adopt a low level of parallelism with respect to multiple generators, which is an inefficient utilization of high-performance hardware. Several derivative forms of Mersenne Twister, e.g. MTGP, have been developed for parallel processing on multi-core processing units, such as a graphic processing unit (GPU) and a central processing unit (CPU), but typically yield results with statistical properties inferior to MT19937.
In addition, given the large quantity of numbers being generated (624) for each iteration and the length of each number, nearly 2.5 KB memory is used to store the state. However, the amount of fast shared memory of a typical graphic processing unit (GPU) running a conventional MT19937 program is only 48 KB per single Streaming multiprocessor (SM) which can only fit 19 such generators. In contrast, the register file of such an SM has much larger capacity, e.g. 128 KB or 256 KB, but unfortunately using the register file for this purpose in the a conventional MT19937 program requires a lot of inter-thread communication.