To implement execution of instructions for handling data (programs) that are written in high-level languages in a particular architecture used for data handling, there are conventional compilers which translate the instructions of the high-level language into instructions that are better adapted to the architecture used. Compilers which support highly parallel architectures in particular are thus parallelizing compilers.
Conventional parallelizing compilers normally use special constructs such as semaphores and/or other methods for synchronization. Technology-specific methods are typically used. Conventional methods are not suitable for combining functionally specified architectures with the particular dynamic response and imperatively specified algorithms. The methods used therefore yield satisfactory results only in special cases.
Compilers for reconfigurable architectures, in particular for reconfigurable processors, generally use macros created specifically for the intended reconfigurable hardware, mostly using hardware description languages such as Verilog, VHDL, or System C to create the macros. These macros are then called up from the program flow (instantiated) by an ordinary high-level language (e.g., C, C++).
There are conventional compilers for parallel computers which map program parts onto multiple processors on a coarse-grained structure, mostly based on complete functions or threads.
In addition, there are conventional vectorizing compilers which convert extensive linear data processing, such as computations of large expressions, into a vectorized form and thus permit computation on superscalar processors and vector processors (e.g., Pentium, Cray).
A method for automatic mapping of functionally or imperatively formulated computation procedures onto different target technologies is described here, in particular on ASICs, reconfigurable modules (FPGAs, DPGAs, VPUs, chess array, kress array, Chameleon, etc.; hereinafter combined under the term VPU), sequential processors (CISC-/RISC CPUs, DSPs, etc.; hereinafter summarized by the term CPU) and parallel computer systems (SMP, MMP, etc.). In this connection, reference is made in particular to the following patents and patent applications by the present applicant: P 44 16 881.0-53, DE 197 81 412.3, DE 197 81 483.2, DE 196 54 846.2-53, DE 196 54 593.5-53, DE 197 04 044.6-53, DE 198 80 129.7, DE 198 61 088.2-53, DE 199 80 312.9, PCT/DE 00/01869, DE 100 36 627.9-33, DE 100 28 397.7, DE 101 10 530.4, DE 101 11 014.6, PCT/EP 00/10516, EP 01 102 674.7, PACT13, PACT17, PACT18, PACT22, PACT24, PACT25, PACT26US, PACT02, PACT04, PACT08, PACT10, PACT15, PACT18(a), PACT27, PACT19. Each of these are hereby fully incorporated herein by reference for disclosure purposes.
VPUs are basically composed of a multidimensional, homogeneous or inhomogeneous, flat or hierarchical array (PA) of cells (PAEs) which are capable of executing any functions, in particular logic functions and/or arithmetic functions and/or memory functions and/or network functions. PAEs are typically assigned a loading unit (CT) which determines the function of the PAEs by configuration and optionally reconfiguration.
This method is based on an abstract parallel machine model which in addition to the finite automaton also integrates imperadve problem specifications and permits an efficient algorithmic derivation of an implementation on different technologies.
The following compiler classes are conventional:
Classical compilers, which often generate stack machine code and are suitable for very simple processors, which are essentially designed as normal sequencers (see N. Wirth, Compilerbau [Compiler Design], Teubner Verlag).
Vectorizing compilers construct mostly linear code which is intended for special vector computers or for highly pipelined processors. These compilers were originally available for vector computers such as the Cray. Modern processors such as Pentium processors require similar methods because of the long pipeline structure. Since the individual computation steps are performed as vectorized (pipelined) steps, the code is much more efficient. However, the conditional jump means problems for the pipeline. Therefore, a jump prediction which assumes a jump destination is appropriate. If this assumption is incorrect, however, the entire processing pipeline must be deleted. In other words, each jump for these compilers is problematical and actually there is no parallel processing. Jump predictions and similar mechanisms require a considerable extra complexity in terms of hardware.
There are hardly any coarse-grained parallel compilers in the actual sense, parallelism typically being marked and managed by the programmer or the operating system, e.g., it is usually performed on a thread level in MMP computer systems such as various IBM architectures, ASCI Red, etc. A thread is a largely independent program block or even a separate program. Therefore, threads are easily parallelized on a coarse-grained level. Synchronization and data consistency must be ensured by the programmer or the operating system. This is complex to program and requires a significant portion of the computation power of a parallel computer. In addition, only a fragment of the parallelism which is actually possible is in fact usable due to this coarse parallelization.
Fine-grained parallel compilers (e.g., VLIW) attempt to map the parallelism in a fine-grained form into VLIW arithmetic means that are capable of executing a plurality of computation operations in one clock cycle but may have a common register set. This limited register set is a significant problem because it must provide the data for all the computation operations. In addition, the data dependencies and inconsistent read/write operations (LOAD/STORE) make parallelization difficult. Reconfigurable processors have a large number of independent arithmetic units, typically located in a field. These are typically interconnected by buses instead of by a common register set. Therefore, vector arithmetic units are easily constructed, while it is also possible to perform simple parallel operations. In contrast with traditional register concepts, data dependencies are resolved by the bus connections.