Parallelizing code with reductions is extremely important for achieving high performance for a wide range of important numerical applications, especially codes simulating atomic interactions.
The parallelization of a scalar reduction is well understood. Consider, for example “for i=0 . . . n do r+=xxx” where “xxx” is a number computed by an expression that does not use the “r” value. The i=0 . . . n do r+=xxx notation represents iterative operation r=a[0]+a[1]+a[2]+a[3]+a[4]+a[n−1] which can also be written as r=0; for (i=0; i<n; i++) r+=a[i] or as r=0; for (i=0; i<n; i++) r=r+a[i]. The “r” value is also referred to herein as the reduction variable, in that it is a variable (i.e., a name for a value) which is the target for the reduction of the sum of the a[i]'s value, in the above example.
Such loop is easily parallelized by computing partial reduction, one reduction variable per SIMD width/parallel thread. The final value is assembled at the end by performing a sequential reduction on each of the partial reduction. Another well understood pattern is that of regular array reduction. Consider for example “for i=0 . . . n do a[i]+=xxx” which represents iterative operations a[i]=a[i]+xxx. In such case, a distinct interval of the “i” loop may be applied to distinct threads and computed in parallel. In this case, the “a[i]” is referred as the reduction variable.
The more challenging pattern is that of irregular array reduction that is frequent in numerical code. Consider for example “for i=0 . . . n do a[b[i]]+=xxx.” In this case, the for loop cannot be directly parallelized as b[i] and b[i′] may be pointing to the same element, where i and i′ are two instances of the iteration variable i, for example i=5 and i′=7. Unfortunately, this pattern is frequent in many numerical applications. In the above case, ‘a[ ]’ is also referred to as a reduction variable, as the xxx values are being reduced into it. However, unlike in the previous case, we now can identify the actual instance of a[ ] that is being reduced only at runtime, as we typically do not know the values of the b[i] until runtime. We also refer to this reduction variable as a reduction variable whose address can only be determinable at run-time. There are other patterns that have the same irregular characteristics, for examples “for i=0 . . . n do *a[i]+=xxx” where a[i=0 . . . n] is an array of pointers, and where the value pointed to by each pointer location a[i=0 . . . n] is incremented by the xxx expression. Note also that while loops with only a single statement have been described, in real applications, there are typically several statements, including conditional statements. Thus, while the same principle applies to more complicated loops, for purposes of discussion, examples described herein are focused on such simpler single statement loop.
To address this, one approach is to privatize the entire “a” array, keeping one private copy per thread, and then assigning a distinct interval of the “i” loop to each thread. In addition to this significant increase in memory footprint (increase proportional to the size of the reduction array and the number of concurrent threads), a final reduction must then be performed on all private threads to generate the final “a[ ]” values.
Another approach uses software/hardware support to parallelize the computations assuming that no conflict will occur, e.g. that no two processors will attempt to concurrently update the same a[i] at the same time. For such scheme, Transactional Memory is ideal, as the software/hardware implementation will undo computation in the occurrence of conflicts. While the hardware approach is in principle faster, it requires significant hardware modifications to the architecture that may/may not be present on the target machine. The software approach is generally too slow to be a competitive solution for such patterns. Furthermore, both approach rely on the assumptions that conflict are infrequent, which is very program/input dependent.
It would be highly desirable to provide a system and method for solving the parallelization of irregular reductions that does not require any custom hardware (except parallel threads/cores) and exhibit good parallel speed while keeping the memory footprint of the original application.