Computer systems often include one or more general purpose processors (e.g., central processing units (CPUs)) and one or more specialized data parallel compute nodes (e.g., graphics processing units (GPUs) or single instruction, multiple data (SIMD) execution units in CPUs). General purpose processors generally perform general purpose processing on computer systems, and data parallel compute nodes generally perform data parallel processing (e.g., graphics processing) on computer systems. General purpose processors often have the ability to implement data parallel algorithms but do so without the optimized hardware resources found in data parallel compute nodes. As a result, general purpose processors may be far less efficient in executing data parallel algorithms than data parallel compute nodes.
Data parallel compute nodes have traditionally played a supporting role to general purpose processors in executing programs on computer systems. As the role of hardware optimized for data parallel algorithms increases due to enhancements in data parallel compute node processing capabilities, it would be desirable to enhance the ability of programmers to program data parallel compute nodes and make the programming of data parallel compute nodes easier.
Data parallel compute nodes generally execute designated data parallel algorithms known as kernels or shaders, for example. In order to execute multiple data parallel algorithms, significant computing overhead is typically expended to launch each data parallel algorithm. Moreover, when a data parallel compute node has a different memory hierarchy than a corresponding host compute node, additional overhead may be expended copying intermediate results of different data parallel algorithms to and from the host compute node.