1. Technical Field
The present invention relates to pipelining, and more specifically to automatic generation of software pipelines for heterogeneous parallel systems.
2. Description of the Related Art
Less than a decade into the era of mainstream parallel computing, heterogeneity has emerged as an important characteristic of parallel computing platforms. Heterogeneous parallel computing platforms combines processing units that have different instruction set architectures, different micro-architectures (e.g., multi-core processors and general-purpose graphics processing units (GPGPUs)), or other many-core accelerators. The processing units that constitute a heterogeneous platform are effectively optimized to serve different parts of an application or workload (e.g., latency-sensitive vs. throughput-sensitive, coarse-grained vs. fine-grained parallel sections, etc.). Several efforts have argued for the fundamental benefits of heterogeneity in parallel computing, and demonstrated speedups on heterogeneous platforms for a wide range of application domains.
There have been efforts related to addressing pipelining in the context of homogeneous parallel platforms. Previous research has shown that it is important to consider all these sources of parallelism and exploit the right combination in order to get best performance on a given parallel platform. Application programming interfaces (APIs) for easy specification of pipeline patterns have been provided in parallel programming framework systems such as, for example, Intel's TBB and Microsoft TPL. The computations encapsulated in pipeline stages are issued as tasks to a lower-level runtime, which schedules the tasks onto the cores of a multi-core platform. These frameworks require the programmer to manually identify the individual stages of the pipeline and program them using the provided API. However, identifying the performance-optimal partition of a program region into pipeline stages is far from trivial, and is currently extremely programming and memory-usage intensive. The stages need to be balanced, since the throughput of the pipeline is determined by the slowest stage. Moreover, it is desirable to minimize the memory consumed by the queues that are used to store the data communicated across stages, precluding the creation of arbitrarily fine-grained stages.
In the context of heterogeneous platforms, the problem is significantly more challenging since the execution time of a stage is highly sensitive to how the tasks in that stage are mapped and scheduled on the various processing units of the heterogeneous platform. Heterogeneous parallel platforms, which are composed of processing units that have different instruction set architectures or micro-architectures, are increasingly prevalent. Previous software programming frameworks for heterogeneous systems exploit data parallelism, where the same computation is performed in parallel on different data elements, and task parallelism, where tasks without interdependencies are executed in parallel. Pipeline parallelism is a different form of parallelism that, when exploited, can significantly improve program performance. However, previous programming frameworks for heterogeneous parallel platforms cannot exploit pipeline parallelism, and previous frameworks leave programmers with significant challenges of tuning accelerator code for performance, partitioning and scheduling applications on different processing units of a heterogeneous platform, and managing data transfers between their (often distinct) memory hierarchies.
Therefore, there is a need for a pipelined program that is optimized for a heterogeneous parallel platform starting from an annotated, non-pipelined specification, where the execution of a program may be automatically pipelined at a coarse granularity across processing units (e.g., multi-cores or graphics processing units (GPUs)).