1. Technical Field
The present principles generally relate to management of computations on a computing platform, and more particularly, to management of computations and data transfers in a hybrid computing system including a parallel accelerator.
2. Description of the Related Art
Domain-specific parallel processors, such as Graphics Processing Units (GPUs) and Network Processors, permit very high levels of computational capability for specific application domains or computations with characteristics that are well suited to their architecture. Domain-specific parallel processors are often added into computing platforms that include general-purpose host processors and are used as accelerators onto which specific computations are offloaded.
For example, graphics processors, which have traditionally been used to execute only graphics processing, have emerged as a promising means to accelerate a wide range of highly parallel, computation-intensive applications. From an architectural perspective, graphics processing units (GPUs) have evolved from specialized application-specific circuits into relatively general-purpose architectures (called GPGPUs) that can be programmed to execute arbitrary computations.
Although GPU architectures are evolving to be increasingly general and programmable, they are specifically optimized for the computational characteristics of graphics processing. Therefore, they are primarily suited to application domains that share the similar computational characteristics. Such domains are, for example, highly data-parallel and computation-intensive and have high memory bandwidth with minimal control-flow and synchronization. For other workloads, mainstream multicore microprocessors retain their advantage.
Because of the great promise of GPGPUs, many applications have been parallelized on GPGPUs. Some examples include computational fluid dynamics, molecular simulations, biomedical image processing, securities modeling in finance, seismic data analysis for oil and gas exploration, image and video processing, and computer vision.
However, many challenges remain to be addressed before the potential of GPGPUs can be truly harnessed on a broader scale. Despite advances in GPU programming due to frameworks such as the Compute Unified Device Architecture (CUDA), BrookGPU from Stanford, and Stream Software Development Kit SDK, writing high-performance GPU programs remains a task that requires familiarity with the GPU architecture. The performance of a GPU program is impacted significantly and in complex ways by factors such as how a computation is organized into threads and groups of threads, register and on-chip shared memory usage, off-chip memory accessed characteristics, synchronization among threads on the GPU and with the host, and data transfer between the GPU memory and host memory. Due to these challenges, GPUs still remain inaccessible to domain experts who are accustomed to programming in very high-level languages. The approach of application-specific GPU ports by GPU programming experts is not scalable in terms of programming effort.
The development of more general programming frameworks that address the aforementioned challenges include (i) libraries and runtimes that implement data parallel programming models, (ii) compilers and autotuners for GPUs, and (iii) stream programming frameworks.
One such example is “Accelerator,” which uses a shader operation graph representation of programs to map them to GPUs. However, Accelerator assumes that GPUs are not general-purpose and generates code in shader language. In addition, because the overhead for a GPU call is high, Accelerator attempts to merge operations aggressively. As another example, CUDA automatically manages the creation and execution of threads and low-level synchronization with the GPU. Other prior work includes general-purpose parallel programming streaming framework such as RapidMind, and PeakStream that provide higher-level languages and APIs that programmers can use to target heterogeneous computing platforms including GPUs. However, prior work fails to address the problem of executing computations that do not fit within GPU memory and managing Central Processing Unit (CPU)/GPU memory transfers efficiently and in a scalable manner.
As stated above, domain specific parallel processors are used with general-purpose host-processors as accelerators onto which specific computations are offloaded. Offloading a computation onto an accelerator requires synchronization and data transfer between the host processor, or its associated host memory system, and the accelerator, or its associated accelerator memory. Unfortunately, programming accelerators is significantly more complex than programming general-purpose processors. In addition to the relatively low level of abstraction supported by accelerator software interfaces such as compilers and runtime libraries, achieving good performance involves careful consideration of the synchronization and data transfer overheads involved in using accelerators.
Most application domains that utilize parallel accelerators process very large data sets. Unfortunately, the memory sizes of the accelerators are themselves often constrained, making it necessary to efficiently manage the limited accelerator memory and data transfers between the host memory and the accelerator memory. Current programming frameworks for parallel accelerators such as GPUs do not provide a means to address this problem. Specifically, current frameworks fail to address executing computations to large data sizes that do not fit within the accelerator memory. Furthermore, prior art frameworks do not provide a means for minimizing data transfers between a CPU and a parallel processor and managing data transfers in a scalable manner. As a result, programs that are written for a specific application and a specific accelerator are not easily scalable to larger data sizes and new accelerator platforms with more memory.