Technical Field
The present invention relates generally to information processing and, in, particular, to compiling a parallel loop with a complex access pattern for writing an array for a Graphics Processing Unit (GPU) and a Central Processing Unit (CPU).
Description of the Related Art
For high performance, an Application Programming Interface (API) is provided for data transfer between a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU) by which only contiguous memory regions are transferred there between. For example, such an API would include cudaMemcpy in CUDA®, the memory coherency mechanism per page in NVLink®, and the cache coherency mechanism per cache line in NVLink2.
For a parallel loop that is executed by multiple threads with write operations to an array, it is not easy for a compiler to generate parallel code for the GPU and CPU which can be executed in parallel when regions of an array to be written by a thread are not contiguous. For example, one difficulty is that it is not known how to correctly generate parallel code in the case that a part of an array is written by other threads that do not execute the parallel loop. As another example, in the case that all of the array elements are written by a parallel loop, a result may be wrong if multiple threads perform data transfer for the whole array from the GPU to CPU. This is, because this transfer may update array elements that were not updated by the GPU and were updated by the CPU.
The preceding can be illustrated with respect to, the following sample pseudocode program:
public void Test extends Thread {int X[ ] = new int[1000];int id;Test(int id) { this.id = id; }void test(int a[ ]) {if (id >= 0)IntStream.rangeClosed(0, 100).parallel( ).forEach(i −> {a[3*i + id] += i; } ); }else a[2] = 2;void run( ) { test(X); }public static void main(String[ ] a) {Test t0 = new Test(0);Test t1 = new Test(1);Test t2 = new Test(−1);t0.start( ); t1.start( ); t2.start( );...}}
As can be determined relative to the preceding pseudocode, a problem exists in how to correctly generate parallel code in the case that a part of an array is written by other threads that do not execute the parallel loop (t2 in a sample pseudocode program).
As can also be determined relative to the preceding pseudocode, another problem exists in how to generate code for GPU and CPU for a parallel loop, which is executed by multiple threads with write operations to an array, in the case that each thread writes data into contiguous array elements (t0 and t1 in a sample pseudocode program).