The present invention relates to processors and compiling methods for processors. In particular, but not exclusively, the present invention relates to processors having distributed functional units and register files and to compiling methods for such processors.
In high-performance computing, a high rate of instruction execution is usually required of the target machine (e.g. microprocessor). Execution time is often dominated by loop structures within the application program. To permit a high rate of instruction execution a processor may include a plurality of individual execution units, with each individual unit being capable of executing one or more instructions in parallel with the execution of instructions by the other execution units. The instructions to be executed in parallel may be combined together in a very long instruction word (VLIW) packet.
A compiling method for such a VLIW processor schedules an instruction by assigning it to be executed in a particular functional unit in a given cycle. The goal of an efficient schedule is to minimise the total execution time of a program by scheduling instructions in such a way as to maximise the use of the available hardware resources. This must be accomplished without violating data dependences among instructions, i.e. the program semantics. A data dependence from instruction A to instruction B means that A produces a value that must be used by B. A is referred to as a predecessor instruction or producer instruction and B is referred to as a successor instruction or consumer instruction.
In a VLIW or other parallel processor the plurality of individual execution units can be used to provide a so-called software pipeline made up of a plurality of individual stages. The concept is similar to a hardware pipeline in that successive loop iterations start before previous ones have completed. However, each software pipeline stage has no fixed physical correspondence to particular execution units.
Instead, when a loop structure in an application program is compiled the machine instructions which make up an individual iteration of the loop are scheduled for execution by the different execution units in accordance with a software pipeline schedule. This schedule is divided up into successive stages and the instructions are scheduled in such a way as to permit a plurality of iterations to be carried out in overlapping manner by the different execution units with a selected loop initiation interval (II) between the initiation of successive iterations. Thus, when a first stage of an iteration i terminates and that iteration enters a second stage, execution of the next iteration i+1 is initiated in a first stage of the iteration i+1. Thus instructions in the first stage of iteration i+1 are executed in parallel with execution of instructions in the second stage of iteration i, taking advantage of the available hardware parallelism.
Compiling methods for processors using software pipelining have to deal with machine resource constraints and data dependences among operations in the loop body. Generating optimal schedules of loops with arbitrary data dependence graphs is known to be a non-polynomial (NP)-complete problem. Because execution time is often dominated by loop structures, solutions to this scheduling problem are strongly desired.
In view of the complexity of the scheduling problem, most compiling methods capable of implementing software pipelining that are of practical use must rely on heuristics (learning by experimenting or Atrial and error≅) to produce optimised solutions. One class of such compiling methods, referred to as modulo scheduling methods, targets innermost loops. A basic schedule of one single iteration is generated, which is issued at fixed intervals (the initiation interval (II). The basic schedule is structured in order to preserve data dependences among operations, even if the initiation interval is much smaller than the basic schedule length. During the steady state a new iteration starts and another one finishes every II cycles. Further details of one modulo scheduling method are given in our co-pending United Kingdom patent application publication no. GB-A-2355094, the entire content of which is incorporated herein by reference.
Most of the known modulo scheduling methods are intended to compile for a target processor which has a number of functional units connected to a single register file and in which the functional units have unconstrained connectivity, i.e. all of the data stored in the register file is accessible by each of the functional units. However, some processors adopt a so-called clustered architecture in which a register file is partitioned into separate register files around which small groups of functional units are clustered. The present applicant=s Opus architecture is one example of such a clustered architecture. The partitioning of register files enables the clustered architecture to support scalable instruction-level parallelism by implementing processor cores with differing numbers of clusters. In the Opus architecture each cluster contains two functional units (execution units), both connected to local register files. Data communication among clusters is only possible between adjacent clusters, which creates a bi-directional ring of clusters.
A preferred compiling method for such a clustered processor is a distributed modulo scheduling (DMS) method which is capable of dealing with distributed functional units and register files. A key feature of DMS methods is the ability to perform in a single phase both scheduling of operations and data distribution among cluster and register files. Further information on DMS methods can be found, for example, in ADistributed modulo scheduling≅, M Fernandes, J Llosa, N Topham, 5th International Symposium on High Performance Computer Architecture, Orlando, USA, January 1999 and AA clustered VLIW architecture based on queue register files, M Fernandes, PhD thesis, Edinburgh University, January 1999.
The input to a DMS method is a data dependence graph (DDG) which represents the data dependences among the VLIW instructions of the innermost loop body and surrounding code. The loop body and surrounding code together form a VLIW section. Operations are scheduled based on the availability of machine resources and on data dependences with direct predecessors in the data dependence graph. A communication constraint occurs when the output operand produced by a predecessor instruction (producer instruction) cannot be read directly by a successor instruction (consumer instruction) being scheduled. One cause of such a communication constraint is the target processor architecture, for example the clustered organisation with its distributed register files.
Although the data distribution mechanisms embodied in DMS methods always try to avoid communication constraints, this is not always possible. When a communication constraint arises, there is a gap between the producer and consumer instructions concerned. In that case, the method must schedule one or more additional Amove≅ instructions to bring the required data to a register file where it can be read by the consumer instruction being scheduled. These one or more move instructions are referred to as an availability chain.
Although an availability chain can overcome a communication constraint, it may also compromise the overall efficiency of the resulting schedule, as inevitably hardware resources are used to execute the move instructions, rather than to execute instructions from the original application program.
Accordingly, it is desirable to provide compiling methods capable of dealing efficiently with communication constraints that can arise in relation to processors with distributed functional units and/or register files.