The invention relates to methods for designing essentially digital devices, and focuses on memory related design issues, more in particular with respect to power consumption of said digital devices.
The use of a memory hierarchy for optimizing at least power consumption of said essentially digital devices, performing data-dominated applications, has been demonstrated in [L. Nachtergaele, F. Catthoor, B. Kapoor, D. Moolenaar, S. Janssen, xe2x80x9cLow power storage exploration for H.263 video decoderxe2x80x9d, IEEE workshop on VLSI signal processing}, Monterey Calif., October 1996.] for a H.263 video encoder and in [S. Wuytack, F. Catthoor, L. Nachtergaele, H. De Man, xe2x80x9cPower Exploration for Data Dominated Video Applicationsxe2x80x9d, Proceedings IEEE International Symposium on Low Power Design, Monterey, pp.359-364, August 1996.] for a motion estimation application. The idea of using a memory hierarchy to minimize the power consumption of said essentially digital devices is based on the fact that memory power consumption depends primarily on the access frequency and the size of the memories. Power savings can be obtained by accessing heavily used data from smaller memories instead of from large background memories. Such an optimization requires architectural transformations that consist of adding layers of smaller and smaller memories to which frequently used data can be copied. Memory hierarchy optimization introduces copies of data from larger to smaller memories in the data flow graph. This means that there is a trade-off involved here: on the one hand, power consumption is decreased because data is now read mostly from smaller memories, while on the other hand, power consumption is increased because extra memory transfers are introduced. Moreover, adding another layer of hierarchy can also have a negative effect on the area and interconnect cost, and as a consequence also on the power consumption because of the larger capacitance involved. The memory hierarchy design task has to find the best solution for this trade-off. Some custom memory hierarchy experiments on real-life applications can be found in literature but often the search space, being the potential memory hierarchies which can be exploited, is not extensively examined. No systematic exploration method for deciding on which memory hierarchy is optimal with respect to power consumption of said digital system is known.
Memory hierarchy design for power optimization is basically different from caching for performance optimization [J. Fang, M. Lu, xe2x80x9cAn iteration partition approach for cache or local memory thrashing on parallel processingxe2x80x9d, IEEE Transactions on Computers, Vol.C-42, No.5, pp.529-546, May 1993.],[D. Kulkarni, M. Stumm, xe2x80x9cLinear loop transformations in optimizing compilers for parallel machinesxe2x80x9d, Technical report, Comp. Systems Res. Inst., University of Toronto, Canada, October 1994.], [N. Manjiakian, T. Abdelrahman, xe2x80x9cReduction of cache conflicts in loop nestsxe2x80x9d, Technical report CSRI-318, Comp. Systems Res. Inst., University of Toronto, Canada, March 1995.], [M. Jimenez, J. Llaberia, A. Fernandez, E. Morancho, xe2x80x9cA unified transformation technique for multi-level blockingxe2x80x9d, Proceedings EuroPar Conference, Lyon, France, August 1996. xe2x80x9cLecture notes in computer sciencexe2x80x9d series, Springer Verlag, pp.402-405, 1996.]. The latter determines how to fill the cache such that data has been loaded from main memory before it is needed. Instead of minimizing the number of transfers, the number of transfers is often increased to maximize the chance of a cache hit, leading to wasted power by pre-fetching data that may never be needed.
It is the aim of the invention to present a formalized method and system for part of the design decisions, related to memory, involved while designing an essentially digital device. The method and system defines how to traverse through and how to limit the search space being examined while solving these memory related design decisions. The method and system focuses on power consumption of said essentially digital device.
A basic idea of the invention is that power consumption in a digital device can be optimized by introducing a particular organization in the memory such that frequently accessed data can be obtained from smaller memories instead of from the often large main memory. In the invention, data reuse possibilities are examined. Such a memory organization defines a memory hierarchy and distributes the memory space needed in said digital device over a plurality of memories. The invented optimization of said power consumption is aimed not only at power consumption minimization but also prevents large increases of total memory area. Note that power consumption minimization results in less energy consumption of said digital device, which can be of practical importance for instance in a wireless context. Also the power dissipation is reduced, which can possibly avoid packaging problems.
A digital device (6, FIG. 1) comprises of a processor (2) with its own local registers (3) and a memory part (1). The memory can be organized in several manners. It can be only one off-chip memory or a serial and/or parallel concatenation of memories (partly off-chip and on-chip). In the invention such a concatenation (further called a memory hierarchy) is assumed. Said hierarchy comprises of a main memory (4) and intermediate memories (5). Memory layers can be defined. Two memories belong to the same layer when said memories are functionally at the same distance from the processor, meaning the same amount of memories are between said memories of said layer and said processor.
In the invention a method for determining an optimized memory hierarchy or organization (how many memories, which size for each memory, interconnection patterns of said memories, for example), such that the digital device can run with an optimal and/or minimal power consumption, is presented. Said optimized memory hierarchy construction is based on examination of data reuse possibilities.
A digital device (6) has a certain functionality, which can be represented by code (24), written in a suitable programming language. The invention focuses on data-dominated applications which are defined as applications wherein power consumption due to accessing data in memories dominates the power consumption due to arithmetic operations. Loop constructs (8) and read instructions (9) on array signals (10), being often defined also as multidimensional data types, are therefore considered to be most important with respect to power consumption optimization.
Indeed each read instruction means accessing a memory. The goal of introducing a memory hierarchy is to ensure that the required data is obtained from a smaller (and thus less power consuming) memory. The data stored in the smaller memories is preferably reused thus improving efficiency. It must be emphasized that introducing a memory hierarchy also requires some code modifications. It must be indicated in each read instruction from which memory the data is read. It must be mentioned that the invention presents a method for determining potential memory organizations. Note that the digital device (6) under consideration must, for the method in accordance with the present invention, only be described with code, describing its functionality. The digital system being considered and optimized is thus need not yet be realized in hardware. The method according to the invention can be exploited in the design steps for such a digital system. Said potential memory organizations can thus serve as indications of how to organize the actual memory of said digital system. The actual assignment of array signals to particular memories is however not necessarily part of the method according to the invention but is not excluded therefrom. Therefore, in the method the terminology xe2x80x9ccopy-candidatesxe2x80x9d, being memories with a particular size, is exploited in order to emphasize that in the method candidates are considered rather than final memories and a decision about the actual memories is not necessarily taken but is not excluded.
The storage of data in a plurality of memories requires also additional reads from the memories in the memory organization and thus leads to additional power consumption. The method according to the invention can include the step of checking whether introducing memory hierarchy helps in optimizing and/or minimizing power or not and whether such a power optimization can be done without a large increase of the total memory size of the memories of said digital device. Hence, the present invention includes optimization of memories based on one or more evaluation criteria. For example, additional power consumption due to more interconnects can also be taken into account.
Data is heavily accessed when it is read or written from within a loop. A loop construct (8) typically comprises of an introductory statement (FOR), defining the loop iterators, a loop body, comprising of arithmetic operations, write and read instructions (9), and a finalizing statement (ENDFOR), closing the loop. When loops are embedded in other loops, a loop nest is defined. A loop nest can be characterized by its loop depth.
In a first aspect of the invention a method and system for determining an optimized memory organization of an essentially digital device is described. The method and system exploits the code description of the functionality of said digital device and essentially uses the loop constructs (loop nests) and read instructions on array signals found in said code.
The present invention includes an automated design system for determining an optimized memory organization of an essentially digital device represented by a code describing the functionality of said digital device, the code comprising read instructions on array signals, the design system comprising: a first computing device for determining for at least one array signal and for at least one read instruction on said array signal a hierarchical chain with a plurality of reusable data groupings; a second computing device for evaluating for combinations of said reusable data groupings an evaluation criterion; a third computing device for selecting the combination of said reusable data groupings with the optimal and/or lowest value of the evaluation criterion; and a fourth computing device for outputting said optimized memory organization determined from said selected combination of said reusable data groupings. The first to fourth computing devices may be comprised by a single computer such as a work station running suitable software for implementing the present invention and using the code describing the digital device as an input. The evaluation criterion may be power consumption. The second computing device may also evaluate combinations in accordance with a further evaluation criterion, e.g. memory size, and the third computing device may select in accordance with a combination of both the evaluation and the further evaluation criteria. The system includes determining the actual physical memories of the digital device in accordance or in dependence upon the output of the fourth computing device.
The present invention includes a method for determining an optimized memory organization of an essentially digital device, comprising the steps of: loading code describing the functionality of said digital device; said code comprising read instructions on array signals; determining for at least one array signal and for at least one read instruction on said array signal a hierarchical chain with a plurality of reusable data groupings; evaluating for combinations of said reusable data groupings an evaluation criterion; selecting the combination of said reusable data groupings with the optimal and/or lowest value of the evaluation criterion; determining from said selected combination of said reusable data groupings said optimized memory organization.
In the method, five main steps (19, FIG. 2), (20), (21), (22), (23) can be identified, being copy-candidate chain construction, copy-candidate tree construction, copy-candidate graph construction, copy-candidate reuse set construction and finally selection, respectively.
The method and system explores data reuse possibilities and does that by evaluating the effect on power consumption and/or memory size of considering reusable data groupings.
The method steps are performed for essentially each array signal in said code separately and must be performed for at least one array signal. Naturally, some array signals can be excluded when the designer thinks that this array signal will not significantly contribute to the power consumption of said digital device. All the considered array signals can be represented in a set (11, FIG. 2). All the read instructions on one array signal can be represented as a subset (12) of the set of all read operations on array signals (11) in said code.
In a first step (19) for each read instruction in said code on the array signal under consideration a copy-candidate chain (33, FIG. 3) is constructed. This results in a set of copy-candidate chains (13, FIG. 2). A copy-candidate chain comprises a serial concatenation of copy-candidates, each representing a memory which can potentially be used in the final digital device. Said copy-candidate chains can be grouped in a set (15) of copy-candidate chains associated with a particular array signal. Said copy-candidate chain can be denoted as a hierarchical chain of reusable data groupings.
In a second step (20, FIG. 2) from at least part of said copy-candidate chains a copy-candidate tree (26) is constructed. In such a copy-candidate tree at least the root node of said copy-candidate tree is equal to the root node of said part of copy-candidate chains, which are equal by construction. The copy-candidate tree at least comprises of a subset of the copy-candidates of said part of said copy-candidate chains. Said copy-candidate trees define a copy-candidate tree set (15). Note that for copy-candidate tree construction at least one copy-candidate chain is necessary. Not all copy-candidate chains of an array signal are however involved. Indeed read instructions on an array signal can be grouped. For each group a copy-candidate tree can be constructed.
In a third step (21) for at least one of said copy-candidate trees a copy-candidate graph (27, FIG. 3) is constructed. Such a graph comprises of nodes (28), (30) representing memories/copy-candidates and edges (29), representing data-transfers between said memories/copy-candidates. For each grouped set of read instructions on an array signal a copy-candidate graph, being part of the copy-candidate graph set (16), can be constructed.
In a fourth step (22) for at least one of said copy-candidate graphs a copy-candidate reuse set (17), (31) is constructed. Each element in said set represents a possible memory hierarchy.
Such an element of said copy-candidate reuse set can also be denoted to be a combination of data reusable groupings. Indeed such a combination can be represented by nodes and their interconnecting edges.
In a fifth and last step (23) for at least one group of read instructions on an array signal one element is selected from the copy-candidate reuse set, associated with said group of read instructions on said array signal and said selected element is placed in an optimal copy-candidate reuse set (18). The elements of said optimal copy-candidate reuse set define said optimized memory organization of said digital device for a group of read instructions on an array signal.
Note that said copy-candidate chain, copy-candidate, copy-candidate graph, copy-candidate reuse set all represent data reuse possibilities. Said copy-candidate chain can equivalently be denoted as a hierarchical chain of reusable data groupings. Said combination of reusable data groupings is an element of the copy-candidate reuse set and can be constructed from said copy-candidate graph.
In an embodiment of the invention said steps are performed for all array signals in said code, describing the functionality of said essentially digital device, selected by the designer of said digital device.
In an embodiment of the invention copy-candidate graphs are constructed from all copy-candidate chains on an array signal, meaning that thus all read instructions of said array signal are grouped in one set. This results in one copy-candidate reuse set for said array signal. The optimal element selected from said copy-candidate reuse set represents then an optimized memory organization of said digital device for said array signal.
In an embodiment said selection and placing of an element of a copy-candidate reuse tree is done after calculating for essentially all elements of said copy-candidate reuse sets an evaluation criterion. Then for each of said copy-candidate reuse sets, the element with the best evaluation criterion is placed in an optimal copy-candidate reuse set (18) whereby said elements in said optimal copy-candidate reuse set represent said optimized memory organization of said digital device.
In an embodiment of the invention said evaluation criterion can be an estimate of the power consumption of said digital device when said digital device would have the memory organization as represented by said element in said copy-candidate reuse set. In embodiment in said evaluation criterion besides power consumption estimates also an estimate of the total memory area of such a memory organization can be taken into account.
It should be emphasized that the method according to the invention works on an array on a signal by-signal basis.
In a second aspect of the invention a method for constructing a copy-candidate chain for a read instruction on an array signal is presented. A copy-candidate chain (33) comprises of a serial concatenation of memories/copy-candidates. The first memory/copy-candidate of the chain is a memory (4) of the digital device, which can store the whole array signal. A loop construct (8) results in a repetitive execution of the loop body. Each execution is called an iteration. A read instruction accesses some data from the main memory (as long as no memory hierarchy is used of course). As the data accessed by the considered read instruction during different iterations can be different in size, the maximum size of all iterations of the considered loop construct must be determined. This maximum size determines a so-called copy-candidate. This is a representation of a memory, to be put in the copy-candidate chain and possibly later in the memory hierarchy. Its size is equal to the maximum size found for that loop construct. Although it is of no use to introduce copy-candidates of size one (as this would be just a register) this is preferably not introduced in the method as a constraint. The exploration method will reject this possibility anyway. The copy-candidates are organized in a chain. The copy-candidates are ordered by size, including that copy-candidates of the same size are represented only once. Note that said ordering of copy-candidates is determined by the ordering of the loop construct to which said copy-candidates are associated. Copy-candidates associated by an outer loop construct are by construction larger or equal to copy-candidates related to an inner loop construct. The copy-candidates can be represented by blocks (as in (25)) or by nodes (as in (27)). The size of the blocks represents the size of the memories. More formally the nodes have a number, said number represents the size of the memories. It is clear that the nesting of loop nests introduces a hierarchy between copy-candidate or reusable data groupings, thus a hierarchical chain of reusable data groupings is introduced.
In an embodiment of the present invention said number can be found by determining the bit width and the word depth of said memories. The bit width is fixed for all memory/copy-candidates of one copy-candidate chain and copy-candidate tree because the bit width is determined by the required bit width of the array signal under consideration. Note that in the above defined procedure a read operation is considered to be embedded in at least one loop (loop construct). Indeed it is assumed that only array read instructions embedded in loop are dominating power consumption as these are frequently accessed. It must also be clear that the first memory/copy-candidate of the chain is just the main memory (with its size determined by the array signal under consideration).
This procedure is illustrated for a piece of code (34). Loop construct 3 has a set of copy-candidates of size one. Implicitly this candidate is rejected. Loop construct 2 has a set of copy-candidates of size six. The maximum size is naturally six which is a valid copy-candidate. For the last loop construct 1 a copy-candidate of also six is found. Naturally this is merged in the chain with the copy-candidate of the previous loop construct.
In a third aspect of the invention a method for constructing a copy-candidate tree (26) for an array signal from the corresponding copy-candidate chains (25) is presented. The root node of the copy-candidate tree is the root node of the copy-candidate chains on which the tree is based, thus equals the main memory/copy-candidate (4)(30). The copy-candidate tree construction at least comprises of identifying said root node as the root node of said copy-candidate chains, being equal by construction.
The copy-candidate chains can share a node (a copy-candidate, a memory of the hierarchy) when their corresponding copy-candidates contain essentially the same data. Therefore in an embodiment of said third aspect of the invention in said construction of said copy-candidate tree at least nodes of said copy-candidate chains which contain essentially the same data can be shared.
Said construction of a copy-candidate tree does not necessarily exploit all copy-candidate chains of said array signal. Indeed read instructions on an array signal can be grouped in different groups. Such a group contains a least one read instruction. A read instruction on an array signal defines a copy-candidate chain. A copy-candidate tree can be constructed for each group of read instructions from the corresponding copy-candidate chains.
In an embodiment of the present invention a copy-candidate tree is constructed for all read instructions of an array signal and thus all corresponding copy-candidate chains.
In an embodiment of the invention a copy-candidate tree is constructed for all read instructions of an array signal being in the same loop nest and thus all corresponding copy-candidate chains.
In a fourth aspect of the invention a method for constructing a copy-candidate IS graph (27) from a copy-candidate tree (26) is presented. A copy-candidate graph consists of nodes. Each node represents a copy-candidate, being a potential memory of the optimized memory hierarchy. For each node of said copy-candidate tree edges from all its ancestor nodes are drawn. An edge (29) represents possible data transfer between the copy-candidates (memories) that are connected by said edge. Each edge gets a weighting factor. Said weight factor is an estimate the amount of data-transfer. It must be emphasized that the first node of the graph represents the main memory (30)(4).
In a fifth aspect of the invention a method for constructing a copy-candidate reuse set from a copy-candidate graph is presented. This is done by selecting at least one path from the root node of said copy-candidate graph to one of the leaf nodes of said copy-candidate tree, whereby each path (32) is represented by an element of said copy-candidate reuse set (31). Note that not all the nodes of the copy-candidate reuse graph must be part of a possible path. Only data transfer from the top node (the main memory) (30) should always be incorporated by definition.
In the sixth aspect of the invention a method for calculating an estimate of the power consumption of an element of a copy-candidate reuse set is presented. The power contribution for one read and write instruction from a particular copy-candidate (memory) is to be determined. The power is found by using a power model, which is typically a function of the size of the copy-candidate (memory). This power for one instruction must then be multiplied with the number of times the instruction gets executed, thus the amount of read and write operations respectively, or more in general the data transfer from and to said copy-candidate (memory). Said information (size, amount of transfer) can be found in the copy-candidate graph for each element of the copy-candidate reuse set, by tracking the relevant path (its edges and nodes). Finally all the contributions along such path must be summed.