The performance of data processing systems is typically measured by the system's throughput as well as by the required size of the program and data memory. Those skilled in the art have recognized that certain types of processing systems, such as those specialized for digital signal processing (DSP) applications, can better utilize their computational resources if they are programmed using a technique called synchronous data flow (SDF) programming. One background reference to such techniques is a paper authored by E. A. Lee and D. G. Messerschmitt titled: "Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing," IEEE Transactions on Computers, Vol. C-36, pp 24-35 (1987).
Using this technique, the system is represented by blocks or nodes, each of which represents a particular system function or actor. Depending upon the complexity of an actor, it could be itself represented by a number of sub-blocks. Each block or node has a segment of program code associated with it which when executed implements the function in the system. Each block can execute (i.e. fire) at any time provided its requisite input data is available.
Thus, a block is a function that is invoked when there is sufficient input data available with which to perform a computation. Blocks that have no inputs can be invoked at any time. Each time a block is invoked it consumes a fixed number of data samples and will produce a fixed number of data samples. A block that has no inputs consumes zero data samples. A block is synchronous if the number of input samples it consumes and the number of output samples produced can be specified a priori each time the block is invoked. A synchronous block A can be represented as shown in FIG. 1a, including a number associated with each input or output to specify the number of input data samples consumed and the number of output data samples produced for each output, each time the block is invoked. A synchronous data flow (SDF) graph is a network of synchronous blocks, as illustrated in FIG. 1b. The arcs between blocks indicate the flow of data between blocks, and can also represent buffer memory in which such data must usually be stored until the block which consumes the data is actually invoked.
FIG. 1c illustrates an SDF graph which indicates the existence of delays between the nodes. Delays used in the signal processing context indicate that there is an offset between the input and output block. The unit delay on the arc between Block A and Block B means that the n.sup.th sample consumed by B is the (n-1).sup.th sample produced by A; the first sample consumed by B is therefore not produced by the source Block A, but is rather part of the initial state of the arc buffer. Thus, Block B can be invoked once before Block A is ever invoked, and the delay thereby affects the way the system starts up.
From the SDF graph representing a system, an execution order-for the actors can be derived. From this schedule, code can be generated for the program from a library of code segments which corresponds to every potential type of actor or block in a system. The code segments are compiled such that the actors or functional blocks are invoked in an order which is consistent with the SDF graph for the system.
In FIG. 2a, an SDF graph is shown which represents a system having three functions or Actors, X, Y and Z. Arcs a and b illustrate the direction of the flow of data between the actors. In FIG. 2b, an execution schedule is defined which is consistent with the graph. X must be invoked first, because Y must have at least one sample of data from X before it may be invoked, and Z must have at least one data sample from Y before it may be invoked. Thus, a program according to the schedule of FIG. 2b would run the software segment for X, then four iterations of Y and finally twelve iterations of Z. At the end of the schedule, all samples of data produced have also been consumed.
There are other schedules which may be derived from the SDF graph in FIG. 2a which are more optimal in terms of the data memory required to buffer the data between actors. FIG. 2c is a table which describes the use of buffer memory represented by the arcs a and b for each block invocation of the schedule in FIG. 2b. It can be seen from the table in FIG. 2c that the total buffered data is at a maximum of twelve samples for the fourth invocation of Y. This schedule is the maximum buffer length schedule for the system.
A second possible schedule which can be derived from the SDF graph of FIG. 2a is shown in FIG. 3a. A table describing the use of buffer memory for this schedule is illustrated in FIG. 3b. It can be seen from this table that the data memory requirement for the schedule of FIG. 3a is half of that for FIG. 2b. Another advantage of the schedule in FIG. 3a is improved latency characteristics of the systems. Block Z begins producing data samples sooner in the schedule of FIG. 3a because Block or Actor Z is invoked sooner in that schedule.
Schedules derived from SDF graphs can also be optimized to minimize the amount of program code necessary to implement the system, thereby reducing the amount of memory necessary to store the system program. This can be accomplished by creating loop constructs (e.g. "do-while") wherever there are iterative invocations of an actor. Thus, for each grouping of repetitive invocations of an actor, only one copy of the code segment associated with that actor is required, plus the small amount of code overhead for setting up and testing the loop. The schedule of FIG. 2b can be written as X(4Y)(12Z), which is referred to as a looped schedule and for which each parenthesized subschedule is known as a schedule loop. Sample target code for this looped scheduled is illustrated in FIG. 4 using the "C" programming language structure. FIG. 4 illustrates the compaction obtained in looping and the comparatively minor overhead necessary to implement the looping.
It can be seen from the previous discussion that where it is important to minimize the amount of memory necessary to buffer data and to store a system's program code, there is a need to optimize schedules derived from SDF graphs. The more iterative the nature of the system, the more opportunity there is for looping in the target program code. Systems designed specifically for DSP applications are particularly amenable to looping. Further, such systems are typically implemented as integrated circuits, which makes the motivation for minimizing program memory size very strong.
A method for deriving looped schedules from SDF graphs was proposed by Shuvra S. Bhattacharyya in a Masters of Science research project paper, submitted on May 9, 1991 to the Department of Electrical Engineering and Computer Sciences at the University of California at Berkeley. The paper is entitled, "Scheduling Synchronous Data Flow Graphs for Efficient Iteration."
The method disclosed by Bhattacharyya is a technique for hierarchically clustering the actors of an SDF graph to expose opportunities for looping. This method forms clusters or supernodes with two actors at a time. A cluster is a group of connected actors which the scheduler considers to be an indivisible unit to be invoked without interruption. The method is entitled "Pairwise Grouping of Adjacent Nodes" (nodes are equivalent to actors or blocks as defined in this document).
An example of the method disclosed by Bhattacharyya can be illustrated with reference to FIGS. 5-8b. FIG. 5 presents a multirate SDF graph which presents opportunities for looping in its scheduling. FIG. 6 illustrates an acyclic precedence graph (APEG) which represents the SDF graph of FIG. 5. FIG. 7 illustrates the hierarchy of clusters created by clustering two nodes at a time and FIG. 8b illustrates the looped schedule which can be derived from the hierarchical decomposition of FIG. 7. The table of FIG. 8a illustrates how the schedule of FIG. 8b can be derived from FIGS. 6 and 7. Because each of the clusters of FIG. 7 spans all invocations of the nodes which it subsumes, each cluster also corresponds to hierarchical clusters in the original SDF graph of FIG. 5. This correspondence is illustrated in FIG. 9.
Had the first and second invocations of D been consolidated with the first and second invocations of cluster 4, however, the resulting APEG subgraph would not translate to the SDF graph of FIG. 5. This consolidation is shown in FIG. 10a, and the resulting schedule in FIG. 10b. The code size is increased substantially because all of the invocations of cluster4 cannot be encompassed within the schedule loop including Actor D and cluster4. The only way such a schedule loop could span all invocations of cluster4 is if the ratio of invocations of D to the number of invocations of cluster4 in that schedule loop is equal to the ratio of the total number of invocations of D to the total number of invocations of cluster4.
The method therefore must select the proper adjacent nodes (i.e. actors) with which to form an appropriate cluster by selecting a node or hierarchical super node as the base node for the cluster (based on the node most likely to be involved in the deepest level of a nested loop); the method must choose between the possible adjacent nodes with which to combine the base node to form a cluster; and the method must verify that the candidate cluster will not result in a deadlocked graph. A deadlocked graph is one where the cluster must create data for another actor or node but which requires data in order to fire itself. This is referred to as a directed delay-free loop.
The base node is selected based on which node not already selected as a base node has the highest frequency of invocation. The adjacent node is selected by choosing the candidate which matches up with the fewest number of base node invocations within a single invocation of a resulting cluster. Verifying that a cluster does not result in a deadlocked schedule requires that for every cluster created, the resulting APEG which includes the new cluster must have a reachability matrix calculated for the resulting APEG. If the reachability matrix contains no nonzero diagonal elements, then no cycle was introduced through the combination of the two nodes. If any of the diagonal elements are nonzero, then the schedule will result in deadlock and the two nodes cannot be combined. FIG. 11a illustrates an APEG; and a reachability matrix for that APEG is shown in FIG. 11b.
The above-described prior art method is extremely complex and resource intensive. The creation of the APEG graphs for complex systems with high sample rates would require a large amount of data memory and computation time. Further, the reachability matrices which must be calculated each time two nodes are clustered demands additional memory space and computation. Still further, there are certain types of systems which this method may not handle optimally. Finally, this method does not appear to be easily adapted to optimize other performance concerns, such as latency or speed of execution.