The present invention relates in general to parallel data processing and in particular to processing of loops with internal data dependencies using a parallel processor.
Loops with internal data dependencies (i.e., where one iteration of the loop relies on a result computed during a previous iteration) are included in a variety of data-processing algorithms. An example is the Mersenne Twister, a well-known algorithm for generating sequences of pseudorandom numbers. One common implementation of the Mersenne Twister generates a stream of 32-bit values with a random-seeming distribution. In this implementation, a state array MT[0:623] of 624 32-bit values is initialized from a seed supplied by a user. To generate 624 pseudorandom numbers, the state array MT is first updated using a feedback shift procedure (referred to herein as the “twister phase” of the algorithm) represented by the following pseudocode fragment:for kk=0 to 622{y=MSB(MT[kk])|LSBS(MT[kk+1]);MT[kk]=U(MT[(kk+397)%624],y);}y=MSB(MT[623])|LSBS(MT[0]);MT[623]=U(MT[396],y);  (Eq. 1)
In this pseudocode fragment, “MSB” is a function that extracts the most significant bit of a 32-bit value, “LSBS” is a function that extracts the 31 least significant bits of a 32-bit value, “|” is a bit-field concatenation operator, and “%” is the modulo operator. U is a bit manipulation function defined as:
                              U          ⁡                      (                          x              ,              y                        )                          =                  {                                                                                          x                    ⋀                                    ⁡                                      (                                          y                      ⁢                                              <<                        1                                                              )                                                                                                even                  ⁢                                                                          ⁢                  y                                                                                                                                                                                                  x                          ⋀                                                ⁡                                                  (                                                      y                            ⁢                                                          <<                              1                                                                                )                                                                    ⋀                                        ⁢                                          (                      2567483615                      )                                                        ,                                                                                                  odd                    ⁢                                                                                  ⁢                    y                                    ,                                                                                        (                  Eq          .                                          ⁢          2                )            
where “<<” is a left-shift operator and “^” is a bitwise XOR (exclusive or) operator.
After updating MT, an array of 624 pseudorandom numbers random[0:623] can be produced in a “generation phase” that uses bit manipulations on each element of MT, referred to herein as “tempering shifts.” For example, the following pseudocode fragment can be used:for kk=0 to 623{y=MT[kk]; y=y^((y>>11));y=y^((y<<7)&2636928640);y=y^((y<<15)&4022730752);y=y^(y>>18);random[kk]=y; }  (Eq. 3)
In this pseudocode fragment, “>>” is a right-shift operator, and “&” is a bitwise AND operator. Unlike Eq. 1, Eq. 3 does not modify any of the values in the state array MT.
After generating the array of pseudorandom numbers, the twister phase can be performed again and another array of 624 numbers can be generated, allowing the pseudorandom sequence to be extended indefinitely. The Mersenne Twister produces a pseudorandom sequence with an extremely long period (219937 in one 32-bit implementation) and computes pseudorandom numbers relatively quickly; hence it has become widely used in a variety of applications.
Conventionally, the Mersenne Twister is executed using a single processing thread. The loop iterations in the twister phase are executed sequentially, then the loop iterations in the random number phase. The Mersenne Twister has also been implemented on parallel processing systems, e.g., systems with multiple CPUs. In such implementations, each CPU executes the algorithm described above to generate a stream of pseudorandom numbers, but each CPU starts from a different seed so that the streams from different CPUs are all different.