Continuing advances in technology combined with dropping production costs have led to a proliferation of electronic devices that incorporate or use advanced digital circuits including desktop computers, laptop computers, hand-held devices such as Personal Digital Assistants (PDAs), hand-held computers, cellular telephones, printers, digital cameras, facsimile machines and other electronic devices. These digital circuits execute the application or algorithms required to provide the functionality of the electronic device. It is desirable for these digital circuits to have high performance with minimal cost. The cost of a circuit is typically measured in terms of its silicon area and is often estimated from the number of components (e.g. functional units, registers, wires etc) in the circuit. The performance of a circuit can be expressed as a combination of several metrics: throughput (i.e. number of tasks executed per clock cycle), latency (i.e. number of clock cycles to complete a single task), and clock speed.
The process of mapping an application or algorithm to digital circuit hardware involves several steps. One of these steps is that of scheduling, i.e., assigning activities to occur at specific points in time. Since the performance of many applications is dominated by the performance of loop nests that may be present in the application code or algorithm, the step of loop iteration scheduling is of particular importance. A loop is an iteration of an expression or expressions for a range of values. A loop nest is a set of loops, each one successively nested within another. Alternatively, a nested loop refers to a program in a high level language such as C, Java, Pascal, etc. that has an “n-deep” loop nest, where n is an integer. In other words, for a 2-deep nested loop, a first loop is nested within a second loop.
Loop iteration scheduling is the assignment of start times for each iteration of the loop nest to specific clock cycles. This step is performed with the objective that the resulting hardware must execute the loop nest at the desired performance or that the resulting hardware must execute the loop nest with maximal performance. Additionally, it might be desirable to minimize the cost of the resulting hardware. The performance of a loop nest is determined by its throughput. i.e., the number of loop iterations started per unit time. Throughput is expressed as the reciprocal of II*T, where the Initiation Interval (II) is defined as the number of clock cycles between successive starts of loop iterations, and T is the clock period.
To meet these objectives in loop iteration scheduling, typically, a set of candidate iteration schedules are generated and then evaluated for validity. Additionally, they may also be evaluated for cost, if needed. Validity of a candidate loop iteration schedule implies that it is possible to satisfy all data dependencies and timing constraints when the loop nest is scheduled according to the candidate loop iteration schedule and with a given II and at a given T. Recurrence cycles are caused when there is a data flow dependence from a program operation to itself in a succeeding loop iteration. For data dependencies to be satisfied around recurrence cycles in loops, the following set of inequalities must be satisfied:Delay(C1)<=II×Distance(C1)Delay(C2)<=II×Distance(C2). . .Delay(CN)<=II×Distance(CN)where there are “N” recurrence cycles in the dependence graph C1, C2, . . . , CN; Delay(Ci) is the total latency around the recurrence cycle Ci; and Distance(Ci) is the sum of the omegas of each dependence edge along the recurrence cycle Ci. The latency around a recurrence cycle is the number of clock periods it takes for the dependencies to travel around the recurrence cycle, and the omega of a dependence edge is the loop iteration separation, as given by the candidate iteration schedule, between the producer and the consumer operations in that data flow.
In this context, it is desirable that the total latency around each recurrence cycle be small, so that a candidate loop iteration schedule is validated for the given II and T. Prior approaches use operation latencies expressed as integer multiples of clock cycles. The recurrence cycle latencies computed by these approaches are conservative, thereby leading to pessimistically rejecting some candidate iteration schedules as invalid. This is illustrated by the example in FIGS. 1A–1C.
FIG. 1A is a code fragment representing a nested loop. Code fragment 100 includes an outer loop 101, and inner loop 102 and a statement 103. FIG. 1B is a loop dependence graph corresponding to the nested loop code fragment of FIG. 1A. In FIG. 1B inter-iteration dependence edges are annotated with iteration distance vectors derived from the source code for the loop expressed in sequential semantics. The dependence graph of FIG. 1B has two recurrence cycles: recurrence cycle C1 consisting of operation 104->edge 106->operation 105->edge 107; and recurrence cycle C2 consisting of operation 104->edge 106->operation 105->edge 108.
As an example, the required performance may dictate an II equal to 3, and T equal to 2.0 nanoseconds. For a candidate iteration scheduling vector λ equal to [100 1]T, the corresponding iteration scheduling wavefront is shown in the iteration space of FIG. 1C. The omega of an edge annotated with iteration distance vector d is the dot product λT·d. Using this relationship, the omega of edge 107 in the iteration schedule given by λ is [100 1]·[0 1]T=1. Similarly, the omega of edge 108 in the iteration schedule given by λ is [100 1]·[1−99]T=1. The omegas of all other edges are 0 because their iteration distance vectors d equal −[0 0]T. The distance associated with recurrence cycle C1, i.e., Distance (C1), is given by the sum of the omegas along its edges (i.e., edges 106 and 107 in FIG. 1B), therefore it is 0+1=1. Similarly, the distance associated with recurrence cycle C2, i.e., Distance (C2) is 0+1=1. Prior approaches use operation latencies expressed as integer multiples of clock cycles. For example, the latency for the multiplication operation (*) may be 3 clock cycles and the latency for the addition operation (+) may be 1 clock cycle. For the recurrence cycle C1, the delay is 3+1=4 clock cycles. The inequality Delay (C1)≦II×Distance(C1) is used giving: 4−3×1<=0, which is not satisfied. Similarly, for the recurrence cycle C2, the delay is 3+1=4 clock cycles. The inequality Delay(c1)≦II×Distance(c1) is used giving: 4−3×1<=0, which is also not satisfied. Therefore, when operation latencies are expressed as integer multiples of clock cycles, the candidate iteration scheduling vector λ equal to [100 1]T is found not to be valid.