The invention relates to methods for designing essentially digital devices, and focuses on memory related design issues, more in particular with respect to power consumption of said digital devices.
An essentially digital device comprises at least of a memory organization (an amount of memories with their sizes and an interconnection pattern); and registers. Such a memory organization is determined during the design process of said digital device. The operation of an essentially digital system can essentially be described as a set of data access operations or instructions on data structures or variables, being stored in said memories.
In [L. Stok, Data path synthesis, integration, the VLSI journal, Vol.1.18, pp.1-71, June 1994.] register allocation, starting from a filly scheduled flow graph (thus ordered data access operations or instructions are used as input), are resented. Said allocation techniques are scalar oriented. Many of these techniques construct a scalar conflict or compatibility graph and solve the problem using graph coloring or clique partitioning. This conflict graph is fully determined by the schedule which is fixed before. This means that no effort is spent to come up with an optimal conflict graph and thus the potential optimization by reconsidering the schedule is. not exploited. Moreover only register allocation is addressed and not memories.
In the less explored domain of memory allocation and assignment for hardware systems, the current techniques start from a given schedule [L, Ramachandran, D. Gajski, V. Chaiyakul, An algorithm for array variable clustering, Proceedings European Design and Test Conference, pp.262-266, Paris, March. 1994.],[P. Lippens, J. van Meerbergen, W. Verhaegh, A. van der Werf, Allocation of Multiport Memories for Hierarchical Data Streams, Proceedings IEEE International Conference on Computer-Aided Design, pp.728-735, Santa Clara, November 1993.],[O. Sentieys, D. Chillet, J. P. Diguet, J. Philippe, Memory module selection for high-level synthesis, Proceedings IEEE workshop on VLSI signal processing, Monterey Calif., Oct. 1996.] or perform first a bandwidth estimation step [F. Balasa, F. Catthoor, H. DeMan, Dataflow-driven memory allocation for multi-dimensional processing systemsxe2x80x9d, Proceedings IEEE International Conference on Computer Aided Design}, San Jose Calif., November 1994.] which is a kind of crude ordering that does not really optimize the conflict graph either. These techniques have to operate on groups of signals instead of on scalars to keep the complexity acceptable.
In the parallel compiler domain [M. Al-Mouhamed, S. Seiden, A Heuristic Storage for Minimizing Access Time of Arbitrary Data Paterns, IEEE Transactions on Parallel and Distributed Systems, Vol.8, No.4, pp.441-447, Apr. 1997.] proposes a technique to partition arrays into groups of data that have to be assigned to different memories such that they can be accessed simultaneously for an SIMD architecture. They combine the constraints of a number of given access patterns into a single linear address transformation that calculates for every data element the memory in which it should be stored to minimize the total access time. This technique allows to avoid the allocation of multi-port memories for storing data with self-conflicts, by explicitly splitting arrays into smaller arrays that can be assigned to single port memories. However said method does not exploit all optimization opportunities for instance by rescheduling data access instructions.
In the scheduling domain, the techniques optimizing for the number of resources given the cycle budget mostly operate on the scalar level. Many of these techniques try to reduce the memory related cost by estimating the required number of registers for a given schedule. Only few of them try to reduce the required memory bandwidth, which they do by minimizing the number of simultaneous data accesses. They do not take into account which data is being accessed simultaneously. Also no real effort is spent to optimize the data access conflict graphs such that subsequent register/memory allocation tasks can do a better job.
[S. Pinter, Register Allocation with Instruction Scheduling: a New Approach, ACM SIGPLAN Notices, Vol.28, pp.248-25, June 1993.] optimizes a conflict graph in the context of scalar register allocation by removing weighted edges in a coloring problem prior to scheduling. However, the conflicts in their initial conflict graph are determined by the sequential ordering of the input code. Also this idea was not applied to groups of scalars.
The Improved Force Directed Scheduling (IFDS) [W. Verhaegh, P. Lippens, E. Aarts, J. Korst, J. van Meerbergen, A. van der Werf, Improved Force-Directed Scheduling in High-Throughput Digital Signal Processing, IEEE Transactions on CAD and Systems, Vol.14, No.8, August 1995.] shows a method wherein scheduling intervals are gradually reduced until the desired result is obtained. The cost function used to determine which scheduling interval has to be reduced at each iteration only takes the number of parallel data accesses to reduce the required memory bandwidth into account. (I)FDS does not take into account which data is being accessed. Balancing the number of simultaneous data accesses is a local optimization which can be very bad globally. In IFDS all data is treated equally, although in practice some simultaneous data accesses are more expensive in terms of memory cost than other. Also the required number of memories cannot be estimated accurately by looking locally only, as is done in IFDS, because all conflicts have to be considered for this.
In a first aspect of the invention a method and a design system for determining an optimized memory organization of an essentially digital device is presented. The design system may be a suitable computer such as a workstation for carrying out the method. The design system is adapted to carry out each of the method steps. Said method and system exploit a representation, comprising at least data access instructions on groups of scalar signals, of the functionality of said digital device, which is under construction. As the method and system focuses on data transfer and storage, it is sufficient to have a control flow graph representation, although the method is not limited to such representation. For said data access instructions the scheduling intervals are optimized, meaning modified, in order to optimize a certain optimization criterion, with the restriction that the execution of said functionality with said digital device is within a predetermined cycle budget or timing. The method and design system according to the present invention provides sufficient memory bandwidth (parallel memory ports) such that the application can be scheduled within the cycle budget during further digital device design steps. The method and design system according to the present invention solves a Storage-Bandwidth Optimization (SBO) problem. The method and system determines for which data parallel access capabilities should be provided such that the cycle budget can still be met with minimum bandwidth requirements on the memory architecture. These requirements are expressed as conflicts in a conflict graph. Access conflicts may be described as single or intra-cycle conflicts as conflicts not in the same cycle are not necessarily considered as conflicts in accordance with the present invention, i.e. the lifetime of the data is not considered in a first approximation. In said evaluation criterion the conflict cost between basic groups and self-conflicts of basic groups can be weighted separately. The task of SBO is to come up with an optimized conflict graph, allowing the memory allocation and assignment tasks to come up with a cheaper memory architecture with fewer memories and ports. In the method and design system, optimized scheduling intervals are determined by optimizing an extended conflict graph with respect to an evaluation criterion being related to the memory cost of said digital device. Finally, a selection of an optimized memory organization satisfying at least the constraints depicted by said optimized extended conflict graph, is performed.
In a second aspect of the invention said extended conflict graph is an undirected hyper-graph, comprising of nodes representing said basic groups; binary edges representing data access conflicts between the two basic groups connected by said binary edge; hyper edges representing data access conflicts between at least three basic groups connected by said hyper edge; and self-edges representing data access conflicts of said basic group connected to itself by said self-edge. Each of said edges is associated with a triplet of numbers, the first number of said triplet defining the amount of simultaneous data accesses to said basic groups of said edges due to read instructions, the second number of said triplet defining the amount of simultaneous data accesses to said basic groups of said edges due to write instructions and the third number of said triplet defining the amount of simultaneous data accesses to said basic groups of said edges due to either read or write instructions, said triplet being characteristic for an at least partial scheduling of said data access instructions of said functional representation, wherein a partial scheduling comprises scheduling intervals. For every conflict, the maximum number of reads (R), writes (W), and total number of data accesses (i.e., read or write) that can occur (RW) simultaneously must be known. This information is shown next to the conflict edges in the form R/W/RW.
In a third aspect of the invention the optimization or evaluation criterion which is optimized with the method or the system according to the present invention, takes into account which data is accessed in parallel and enables separate weighting of each of the basic group conflicts and each basic group self-conflict. Said evaluation criterion comprises an estimate of the chromatic number of the conflict graph, being defined as said extended conflict graph without self-edges and hyper-edges. Further said evaluation criterion comprises of the total amount of data accesses of each self-edge separately and a pair-wise basic group conflict cost, also for each basic group conflict separately. Said pair-wise basic group conflict costs take into account the sizes of said basic groups, the total amount of data accesses to said basic groups, the bit width and word size of said basic groups.
In a fourth aspect of the invention an optimized memory organization is selected which satisfies at least the constraints depicted by said optimized extended conflict graph, comprising assigning basic groups being in conflict either to different memories or assigning basic groups being in conflict to a multi-port memory having at least a number, defined by said third number of the triple, of ports, at least a number, defined by said first number of the triplet, of said ports, having read capability, and at least a number, defined by said second number of said triplet, of said ports, having write capability. The Extended Conflict Graph represents the constraints that have to be satisfied by the subsequent memory allocation and assignment tasks to be sure that the cycle budget can still be met later on during detailed scheduling. When two basic groups are in conflict, this conflict has to be resolved during memory allocation/assignment. This can be done in two ways: either the basic groups are assigned to two different memories, or they are assigned to a multiport memory. In the latter case, the R/W/RW numbers associated with the conflict determine the number and type of ports that are minimally required on the multi-port memory to which these two basic groups are assigned: the memory must have at least RW ports, of which at least R must provide read capability and at least W must provide write capability. When more than two conflicting basic groups that are connected by a hyper edge in the ECG are assigned to a single memory, the R/W/RW number of the hyper edge determines the number and type of ports that are minimally required on the multi-port memory to which they are assigned.
In a fifth aspect of the invention a method and a design system for solving said optimization problem is presented. Said method or system involves an iterative procedure, starting from an initial scheduling of said data access instructions. An initial value of the optimization or evaluation criterion is determined. In said evaluation criterion the probability of having a basic group conflict is taken into account. Note that a conflict graph is only defined for a given schedule. As here only probabilities of conflicts are known, a particular approach for determining a chromatic number in such is situation is needed and thus presented in the invention. In the method or design system a plurality of possible scheduling interval reductions are determined. The effect of each of said reductions on the evaluation criterion is determined, the best reduction (having the largest impact on the criterion) is selected. Said selected reduction is then executed. The set of possible scheduling intervals is then modified. Said procedure is repeated until no further reductions in the evaluation criterion can be found.
In a sixth aspect of the invention the method and design system for determining an optimized memory organization is adapted for applications, having a representation comprising of manifest conditions, data-dependent conditions and loop bodies. In said method and said design system a preprocessing step is performed which determines for disjunct blocks of said representation a block cycle budget. Said block cycle budgets are then used as additional constraints within said determining of optimized schedule intervals. Said determining of block cycle budgets comprises optimizing an allowable conflict graph with respect to an evaluation criterion for said allowable conflict graph. An iterative procedure for finding an optimized allowable conflict graph is presented.
In a seventh aspect of the invention the determining of basic groups, being groupings of scalar signals, is presented for real-time multi-dimensional applications and network applications, with dynamically allocated data types.