The initial specification of an application to be implemented on an embedded platform is often written in sequential code, for instance sequential C or C++ code. A typical example of an application is the MPEG-4 reference code. For a high performance low power implementation on an embedded system or platform, the code must be analyzed, transformed and parallelized before it is compiled, or more ideally when it is compiled. The target system or platform is typically heterogeneous and can contain programmable as well as custom processor cores. With parallelized is meant that the code is broken or partitioned in at least two parts, which can be executed in real-time at least partially at substantially the same time.
Parallelization is an important step that can have serious impact on performance and power. Ideally, processor load must be balanced and operations must be assigned to the processor that executes them most efficiently. In addition, inter-processor communication must be taken into account. Today, parallelization is typically done manually, by an experienced designer. As it is a time consuming and error prone task, it is typically done once in the design cycle: time-to-market pressure leaves no time for exploration.
A particular form of parallelization is so-called program splicing, determining parallel processes in sequential code, which contribute each separately to the result (or an intermediate result) of a program. Said parallel processes are not communicating information between each other.
Embedded applications such as multimedia applications are often streaming applications with a relatively complex control flow, including data dependent conditions. To distribute such an application over processors, coarse grain, also denoted task level parallelization techniques must be used. Fine grain techniques such as instruction level parallelism or fine grain data (e.g. SIMD) parallelism only apply to a single processor and cannot handle complex data dependent control flow. Parallel executable codes are denoted task parallel or coarse grain parallel if they are intended to run on a separate processor.
A streaming application is often parallelized manually by splitting it in pipeline stages. Pipelining can be considered as a subset of parallelization. Each pipeline stage is then assigned to a processor that can efficiently handle the operations required in that stage see FIG. 2. This approach differs from the approach taken by existing parallelizing compilers These compilers target symmetric multiprocessors and exploit data parallelism by executing different iterations of a loop in parallel on different processors, hence again only applicable when these iterations do not have to communicate with each other—see FIG. 2. They are not able to efficiently exploit specialized processor platforms or systems.
Other methods for performing task level pipelining (sometimes also called coarse grain or functional pipelining) to parallelize embedded applications are discussed below. There are however important differences in what these programs can do and in the restrictions they have on the sequential input description of the application to be parallelized.
The FP-Map tool [I. Karkowski, H. Corporaal, “Design of Heterogeneous Multi-processor Embedded Systems: Applying Functional Pipelining”, PACT'97, San Fransisco, USA, November 1997], [I. Karkowski, H. Corporaal, “Overcoming the Limitations of the Traditional Loop Parallelization”, Proceedings of the HPCN'97, April 1997] [I. Karkowski, and H. Corporaal, “FP-Map—An Approach to the Functional Pipelining of Embedded Programs”, 4th International Conf. on High-Performance Computing, pp. 415-420, Bangalore, December 1997.], [I. Karkowski, H. Corporaal, “Design Space Exploration Algorithm for Heterogeneous Multi Processor Embedded System Design”, 35th DAC Anniversary, pp. 82-87, San Francisco, Calif., USA, June 1998] explicitly targets embedded platforms and uses task level pipelining (also called functional pipelining). It automatically determines task boundaries based on execution times and branch probability estimates, hence without allowing user selection. FP-Map does no interprocedural analysis, and has no code generation. FP-MAP accepts arbitrary C code (i.e. it is not limited to affine nested loop programs). It automatically distributes the statements of a given loop nest over a given number of processors, minimizing the initiation interval of the pipeline. The loop nest to be parallelized must be tightly nested: a nested loop must be the only statement in the body of the surrounding loop. Function calls in the loop body are assumed to be atomic and are never distributed over processors. The data flow analysis used in FP-MAP is dynamic: data dependences are registered during execution of the C code for a given set of inputs. This means that FP-MAP can analyze data flow for arbitrary control flow including data dependent conditions and for arbitrary index expressions, but it is only correct if the selected set of inputs triggers all relevant execution paths. The distribution algorithm relies estimates of execution times and branch execution probabilities obtained through profiling. Execution times are assumed to be independent of the processor, and the cost and feasibility of the resulting communication on the target platform is not taken into account. FP-MAP does not automatically insert communication channels or generate code; the result is an assignment of statements to processors.
The Compaan tool [Bart Kienhuis, Edwin Rijpkema, Ed Deprettere, “Compaan: deriving process networks from Matlab for embedded signal processing architectures”, Proceedings of the eighth international workshop on Hardware/software codesign, p. 13-17, May 2000, San Diego, Calif., United States], [Todor Stefanov, Claudiu Zissulescu, Alexandru Turjan, Bart Kienhuis, Ed Deprettere, “Compaan: deriving process networks from Matlab for embedded signal processing architectures”, DATE 2004, February 2004, Paris, France], [Alexandru Turjan, Bart Kienhuis, Ed Deprettere, “Translating affine nested-loop programs to Process Networks”, Proceedings of the International Conference on Compiler, Architecture, and Synthesis for Embedded Systems (CASES), 2004, Washington USA] automatically derives a Kahn process network from a sequential application. Kahn process networks are closely related to task level pipelining and are a natural match for streaming applications. However, Compaan starts from Matlab, not C, and can only handle affine nested loop programs. For COMPAAN the application to be parallelized must first be rewritten as an affine nested loop program (ANLP). A nested loop program is a program consisting of: for-loops, if-then-else statements, and function calls. A function call represents an (unspecified) operation; its inputs and outputs can be scalars and (indexed) arrays. A nested loop program is said to be affine if all index expressions, loop bounds and arguments of comparison operators in conditions are linear combinations of the surrounding loop iterators. COMPAAN can also handle some occurences non-linear operators such as div, mod, ceil and floor; these operators can be eliminated by transforming the nested loop program. COMPAAN automatically transforms an affine nested loop program into an equivalent process network. Each function call becomes a process, and processes communicate through FIFOs. If productions and consumptions of the data to be communicated match one-to-one, a plain scalar FIFO is inserted. Otherwise, a special FIFO is used that can reorder and/or duplicate tokens. Exact data flow analysis techniques [Paul Feautrier, “Dataflow Analysis of Scalar and Array References”, International Journal of Parallel Programming 20(1):23-53, 1991.], [D. E. Maydan, S. P. Amarasinghe, and M. S. Lam, “Data-Dependence and Data-Flow Analysis of Arrays”, Proceedings of the 5th Workshop of Languages and Compilers for Parallel Computing, pp 434-448, August 1992.], [William Pugh, “The Omega Test: A Fast and Practical Integer Programming Algorithm for Dependence Analysis”, Communications of the ACM 35(8):102-114, 1992.] are used to determine where FIFOs need to be inserted and what kind of FIFO is needed.