The interest in parallel computer systems has increased rapidly in recent years. Several physical problems, in particular that of power density, prevents further increase of clock frequency, which makes parallel execution the most viable path to further significant growth of performance.
Parallel computing using a multitude of processors executing in parallel is one solution attracting interest and research. The use of parallel computer systems is a particularly attractive solution if performance is put in relation to power consumption and related metrics, such as heat dissipation. However, it is hard to develop software that efficiently utilizes parallel computer systems; development cost and lead time present obstacles to progress in this direction.
Multiprocessing computer systems, including multiple-instruction stream multiple-data, MIMD, architectures utilizes several CPUs, which operate in parallel, such that computational tasks may be distributed over the CPUs. Computer clusters, multi-core or many-core processors and processors with support for hardware multi-threading, including hyper threading, are examples of multiprocessing computer systems or alternatively building blocks of such systems.
In contrast, a synchronous parallel computer, such as a single-instruction stream multiple-data, SIMD, architecture, may comprise a single CPU which decodes a single instruction stream and multiple processing elements each of which consists at least of an ALU and memory. In this case the parallelism is achieved by performing a single operation on multiple instances of data. Processor arrays, vector computers and parallel stream processors, including graphics processors, GPUs, are examples of this class of architectures. The border between the two classes of parallel architectures is not clear-cut: a GPU for instance may comprise several CPUs, which decode independent instruction streams and provide hardware support multi-threading, thus an MIMD architecture. Further, each CPU of the graphics processor may generally comprise multiple processing elements including a SIMD architecture.
However, developing software that efficiently utilizes parallel computer systems is costly and time consuming. One solution to this problem is to write programs that are independent of the target architecture at hand and transform the program into a form, which exposes parallelism in a manner that is appropriate for the particular target architecture. Such program transformation is known as parallelization in the context of MIMD architectures and vectorization in the context of SIMD architectures. Parallelization and vectorization have been studied extensively in the area of high-performance computing. A fundamental part of these tasks have typically included dependence analysis, a task whose complexity depends on the programming language being analyzed. In particular, it is well-known that programming languages with pointers, such as the C programming language, make the analysis of data dependence a very complicated matter. Analysis of the dependence caused by array references may also be complex and may generally be solved by heuristic methods and approximation.
As computer software traditionally has been written for serial computation, sequential computer programs are not laid out for easy parallelization. Thus, dataflow programming is investigated to specify massively parallel algorithms, and though the dataflow programs are easier to parallelize, they still need to be mapped onto the different processing units of the system. The mapping may be done statically at compile time or dynamically at run-time. Static scheduling, of the entire program or part thereof, is beneficial in certain situations and in particular, the run-time overhead, which is typically associated with dynamic scheduling, may be avoided.
However, mapping a dataflow program onto a number of processors, each performing a specific subtask, is not straight forward as for example synchronization between the different subtasks must be achieved. Furthermore, typically, a program may comprise parts that are parallelizable as well as parts that are non-parallelizable, i.e. sequential.
It has been suggested to perform parallelization by finding looped schedules in synchronous dataflow programs. A looped schedule may be seen as a serialization of the actor firings in the form of a loop nest, which means that traditional parallelization techniques are applicable, see for example S. S. Bhattacharyya and E. A. Lee, “Scheduling Synchronous Dataflow Graphs for Efficient Looping”, J. VLSI Signal Processing, 6, pp. 271-288, Kluwer Academic Publishers, 1993. However, there are generally a large number of options for both the loop nest and the serialization of the actor firings and the choices made affects the properties of the resulting parallel program, such as the CPU utilization, latency, synchronization overhead, storage requirements etc. Furthermore, the formation of a looped schedule makes premature decisions, which may likely lead to suboptimal solutions.