1. Technical Field
This disclosure relates generally to software for data prefetching in a data processing system and more specifically to software for shared data prefetching and coalescing using partitioned global address space languages programming loops in the data processing system.
2. Description of the Related Art
Partitioned Global Address Space (PGAS) programming languages offer a high-productivity programming model for parallel programming which is attractive to application developers. PGAS languages, such as Unified Parallel C (UPC) and Co-array Fortran, combine the simplicity of shared-memory programming with the efficiency of the message-passing paradigm (MPI). Both languages are increasingly attractive alternatives to previous established parallel programming models due to conceptual simplicity and performance potential at a reduced level of program complexity.
Most of the execution time of typical PGAS application programs is consumed in performing data transfers to and from a distributed shared address space. Shared data is typically distributed across a large number of cluster nodes; therefore accessing shared data typically involves network communication between nodes. The need to transfer data between different cluster nodes often becomes a performance bottleneck for this type of application programs.
In one example an existing optimization solution attempts to reduce the number of data transfers flowing across the communication network by coalescing shared accesses to elements of the same shared array together when a compiler can prove the shared accesses are executed by the same thread and map to shared storage associated with a remote thread. The existing approach requires the compiler to ensure the two previously stated conditions hold. Existing static analysis techniques focus on the UPC work sharing loop construct (upc_forall) and, for each shared array access in the parallel loop attempt to determine whether the array element referenced by the executing thread resides in the portion of the shared memory space allocated with affinity to a particular thread. When established, this relationship between an accessing thread and a shared memory storage location of an array element can be used by a compiler to optimize the communication requirements of the program.
Two possible optimizations driven by the result of the analysis are privatization and coalescing of shared memory accesses. The privatization optimization targets shared accesses that have proven affinity with the executing thread (shared accesses have associated storage physically located on the cluster node where the executing threads runs). The coalescing optimization targets shared accesses that have proven affinity with the same remote thread (a thread that runs on a different cluster node from the one where the executing thread runs on). Static analysis may be able to coalesce data when a physical data mapping is available, for example, a number of threads and number of nodes is known at compile time.
The existing static locality analysis techniques address upc_forall loops and are typically of no use for other commonly used loop constructs such as for loops and do/while loops. Furthermore the existing locality analysis techniques may not have sufficient information, at compile time, to successfully analyze all shared accesses in upc_forall loops. A UPC program typically makes extensive use of loop constructs other than the upc_forall work sharing construct and consequently leaves substantial optimization opportunities that are not addressed by existing technology.