Computer systems often include one or more general purpose processors (e.g., central processing units (CPUs)) and one or more specialized data parallel compute nodes (e.g., graphics processing units (GPUs) or single instruction, multiple data (SIMD) execution units in CPUs). General purpose processors generally perform general purpose processing on computer systems, and data parallel compute nodes generally perform data parallel processing (e.g., graphics processing) on computer systems. General purpose processors often have the ability to implement data parallel algorithms but do so without the optimized hardware resources found in data parallel compute nodes. As a result, general purpose processors may be far less efficient in executing data parallel algorithms than data parallel compute nodes.
Data parallel compute nodes have traditionally played a supporting role to general purpose processors in executing programs on computer systems. As the role of hardware optimized for data parallel algorithms increases due to enhancements in data parallel compute node processing capabilities, it would be desirable to enhance the ability of programmers to program data parallel compute nodes and make the programming of data parallel compute nodes easier.
A common technique in computational linear algebra is a tile or block decomposition algorithm where the computational space is partitioned into sub-spaces and an algorithm is recursively implemented by processing each tile or block as if it was a point. Such a decomposition, however, involves a detailed tracking of indices and the relative geometry of the tiles and blocks. As a result, the process of creating the indices and geometry may be error prone and difficult to implement.