Microprocessors are general-purpose processors that provide high instruction throughputs in order to execute software running thereon, and can have a wide range of processing requirements depending on the particular software applications involved. Many different types of processors are known, of which microprocessors are but one example. For example, Digital Signal Processors (DSPs) are widely used, in particular for specific applications, such as mobile processing applications. DSPs are typically configured to optimize the performance of the applications concerned and to achieve this they employ more specialized execution units and instruction sets. Particularly in applications such as mobile telecommunications, but not exclusively, it is desirable to provide ever-increasing DSP performance while keeping power consumption as low as possible.
VLIW processors, capable of executing multiple instructions per cycle, are designed to exploit instruction-level parallelism (ILP). In order to take advantage of existing ILP, these processors have both a large number of registers and a large number of functional units. Clock cycle constraints make it impossible to have a unified architecture with full connectivity between the register file and all functional units. Hence, these architectures usually have split register files where the register file is split into two or more register files, each of which is connected to a set of functional units. These register files in conjunction with their functional units are generally referred to as “clusters”.
Compilers for processors with VLIW architectures generally use software pipelining to obtain good performance from loops. These architectures are typically used for image processing, and other mathematically intensive DSP applications. On average, approximately 90% of the execution time of these applications is spent executing loops. Hence, a lot of optimization effort is aimed at improving loop performance.
On a typical multi-cluster system, instructions are usually explicitly assigned to clusters by a compiler implementing one or more cluster assignment algorithms. The goal of these cluster assignment algorithms is to assign instructions to clusters such that ILP is maximized and cross-cluster communication is minimized. There are various existing cluster assignment algorithms, the classical one being the Bottom-Up Greedy algorithm (BUG). These algorithms typically run either before or in parallel with scheduling and register allocation.
Cluster assignment algorithms generally operate on a data dependence graph (DDG) which represents the flow of data between instructions in the body of a loop. Each node of the graph represents one instruction. Each directed edge represents the flow of data from one instruction to the next. The source node defines the data used at the sink node. Data can be either register values or memory values. The graph can contain both forward and backward edges. Forward edges represent intra-iteration dependencies. Backward edges represent inter-iteration dependencies, where values that are defined during one iteration are then used during a subsequent iteration.
The edges represent dependence partitioning constraints. If the source of a register edge is assigned to a cluster different than the sink of that edge, then data must be moved between clusters. The edges also represent scheduling constraints. The source node must be scheduled a certain number of cycles before the sink node, known as the minimum latency requirement.
Each node in a DDG has an associated e13 cycle, l_cycle and slack range. These are computed as follows. Ignoring back-edges and assuming infinite resources, the earliest cycle on which a node may be scheduled is known as its e_cycle. The latest cycle on which a node may be scheduled and still generate the shortest possible schedule is known as the l_cycle. The slack range for a node is defined as l_cycle-e_cycle.
Prior art cluster assignment algorithms generally work quite well but there are sometimes performance inefficiencies when they are applied to unrolled loops. Loops are usually unrolled so that the unroll factor (number of copies of the loop body) is a multiple of the number of clusters. In essence, there should be a natural mapping of instructions to clusters. However, in some cases, the resulting cluster assignment using these prior art algorithms does not adhere to this natural mapping and an unnecessarily high amount of cross-cluster communication results.
Cluster assignment is usually performed before scheduling and register allocation. The goal of cluster assignment for unrolled loops is to generate a partition with maximum flexibility and minimum resource requirements, so that a minimum of extra constraints are imposed on the scheduler and register allocator. This is generally achieved when:    Functional unit usage is balanced    Opportunities for parallelism are maximized    Cross-cluster transfers are minimized    Registers copies are minimized    New instructions (e.g., cross-cluster moves) are minimizedNote that minimizing recurrence constraints did not appear on the list. The reason is that unrolled loops are typically not recurrence-bound. Thus, pushing out recurrence bounds is not a primary concern. Hence, general cluster assignment algorithms, which prioritize nodes involved in recurrences, are not tailored for unrolled loops.
It is not always possible to assign instructions evenly across all functional units. However, in the case of loops, which are unrolled by a multiple of the number of clusters, it may be possible to achieve a nearly even balance across functional units of a given class across clusters. Even when a functional unit class is not a limited resource, there is more scheduling flexibility and more balanced register usage when the load is balanced evenly across all functional units, not just the bottlenecked ones. This allows maximum flexibility to schedule around dependence constraints and other resources that are in short supply. The lack of consideration of functional unit classes is a serious limitation of some prior art cluster-assignment algorithms.
In a DDG, if there is no dependence between two nodes, the corresponding instructions may be executed in parallel. Traditional scheduling algorithms only exploit intra-iteration parallelism. Software pipelining, the preferred approach for scheduling loops on VLIW processors, exploits both intra- and inter-iteration parallelism. When dependence constraints permit, software pipelining schedulers exploit the ILP that is available across loop iterations.
Most cluster assignment algorithms put together quick-and-dirty straight-line schedules to determine which nodes are candidates for parallelization. If the instructions are in parallel in the trial schedule, the cluster assignment algorithm tries to assign them to different clusters. Otherwise, the algorithm assumes that there is no benefit to scheduling the instructions in parallel. This approach is very limiting. First, if the loop is going to be software-pipelined, the trial straight-line schedule may be very different from the final software-pipelined schedule. Second, the introduction of cross-cluster communication can shift the set of instructions that might best be scheduled in parallel.
It should be straight-forward to find parallelism in unrolled loops. In theory, if a cluster assignment algorithm can identify the loop body copy to which an instruction belongs, it can simply map loop body copies to different clusters. In practice, this entails marking instructions when loops are unrolled and maintaining these markings across intervening optimizations. Marking instructions from unrolled loops has two drawbacks. First and foremost, marking does not handle manually unrolled loops. Second, it entails significant bookkeeping since all intervening optimizations must maintain these markings.
Most cluster assignment algorithms balance resources locally “on-the-fly”, based on the assignments to nearest neighbors in the data dependence graph. Using this approach, functional unit usage may be balanced but cross-cluster transfers may be unnecessarily high. BUG, for example, which balances resources locally using a depth-first approach, can yield a checkerboard pattern when partitioning graphs from unrolled loops.