The mixture of computational elements that make up a computer is increasingly becoming heterogeneous. Already computers today couple a conventional processor (e.g., central processing unit (CPU)) with a graphics processor (GPU), and there is increasing interest in using the GPU for more than graphics processing because of its exceptional computational abilities at particular problems. In this way, a computer with a CPU and a GPU is heterogeneous because it offers a specialized computational element (the GPU) for computational tasks that suit its architecture, and a truly general purpose computational element (the CPU) for all other tasks (e.g., including if needed the computational tasks that are well suited for the GPU). The GPU is an example of a hardware accelerator. In addition to GPUs, other forms of hardware accelerators are gaining wider consideration, and there are already examples of accelerators in the form of field programmable gate arrays (FPGAs) and fixed-function accelerators for cryptography, XML parsing, regular expression matching, physics engines, and so on.
Programming technologies exist for CPUs, GPUs, FPGAs, and various accelerators in isolation. For example, programming languages for a GPU include OpenMP, CUDA, and OpenCL, all of which can be viewed as extensions of the C programming language. A GPU-specific compiler inputs a program written in one of these languages, and preprocesses the program to separate the GPU-specific code (hereinafter referred to as device code) from the remaining program code (hereinafter referred to as the host code). The device code is typically recognized by the presence of explicit device-specific language extensions, or compiler directives (e.g., pragma), or syntax (e.g., kernel launch with <<< . . . >>> in CUDA). The device code is further translated and compiled into device-specific machine code (hereinafter referred to as an artifact). The host code is modified as part of the compilation process to invoke the device artifact when the program executes. The device artifact may either be embedded into the host machine code, or it may exist in a repository and identified via a unique identifier that is part of the invocation process.
Programming languages and solutions for heterogeneous computers that include a FPGA are comparable to GPU programming solutions although FPGAs do not enjoy the benefits of a widely accepted C dialect yet. There are several extensions to the C language offered by the various FPGA-technology vendors, all of whom generally compile code written in their C dialect in a manner very similar to that followed by the compilers for GPUs: the compiler partitions the program into device (FPGA) code and host code, each is separately compiled, the host code is modified to invoke the device artifact.
Regardless of the heterogeneous mix of processing elements in a computer, the programming process to date is generally similar and shares the following characteristics. First, the disparate languages or dialects in which different architectures must be programmed make it hard for a single programmer or programming team to work equally well on all aspects of a project. Second, relatively little attention has been paid to co-execution, the problem of orchestrating a program execution using multiple distinct computational elements that work seamlessly together. This requires partitioning a program into tasks that can map to the computational elements, mapping or scheduling the tasks onto the computational elements, and handling the communication between computational elements which in itself requires serializing data and preparing it for transmission, routing data between processors, and receiving and deserializing data. Given the complexities associated with orchestrating the execution of a program on a heterogeneous computer, a very early static decision must be made on what will execute where, a decision that is hard and costly to revisit as a project evolves. This is exacerbated by the fact that some of the accelerators, for example, FPGAs, are difficult to program well and place a heavy engineering burden on programmers.