Graphical processing units (GPUs) provide high computation capabilities at lower prices than comparable central processing units (CPUs). For example, one particular GPU can compute one trillion floating point operations in a single second (i.e., one teraflop). GPUs may be provided in a variety of devices (e.g., desktop computers) and/or systems (e.g., a high performance computing center) to provide improved numerical performance.
A GPU may include a number of characteristics. For example, a GPU may include many vector processing elements (e.g., cores) operating in parallel, where each vector core addresses a separate on-device memory. There is high memory bandwidth between the on-device memories and the vector cores, and memory latency is relatively large (e.g., four-hundred clock cycles). A GPU may provide zero overhead thread scheduling (e.g., which enables algorithms with high thread counts); however, the GPU may include limited support for communications between threads. A relatively low memory bandwidth is provided between the GPU's device memory and host memory. A GPU also provides limited support for general-purpose programming constructs (e.g., code executing on the GPU cannot allocate memory itself, this must be accomplished by a host CPU).
These characteristics mean that programming for the GPU is not straightforward and highly parallel algorithms need to be created for the GPU. A typical high-level program will be hosted on a CPU that invokes computational kernels on the GPU in a sequence to achieve a result. Because of the relatively low bandwidth available to transfer data to and from the GPU's own memory, efficient programs may transfer data only when necessary. Furthermore, in such high-level programs, GPU-executable programming code is not compiled prior to execution, but rather is compiled during execution (e.g., when such code is needed by the CPU).