A current focus for improving processor power is to provide multiple processor cores on a die to increase processor throughput. Central processing units, in higher-performing computing devices, such as workstations and servers, often include several processor cores included on a single die. Many of these devices also include one or more graphics processing units that each can include hundreds of processor cores on a single die. Graphics processor units, in addition to providing computations for computer graphics, are often configured to provide computations in applications previously provided by the central processing with a technique referred to as general purpose computing on graphics processing units, or GPGPU. In one example, GPGPU computing uses central processing units and graphics processor units together in a heterogeneous co-processing computing model. The sequential or relatively light-parallel parts of the application runs on the cores in the central processing units, and the computationally-intensive, often massively-parallel parts of the application are accelerated by the many cores in the graphics processing units. Parallel computer applications having many concurrent threads executed in GPGPU computing can realize a performance boost ten to one hundred times that over the applications executed on multiple core central processing units. Additionally, GPGPU systems typically are less expensive and use less power per core than multiple core central processing units.
Parallel computer applications having concurrent threads and executed on multiple processors present great promise for increased performance but also present great challenges to developers. The process of developing parallel applications is challenging in that many common tools, techniques, programming languages, frameworks, and even the developers themselves, are adapted to create sequential programs.