The present invention relates generally to high performance computer systems and more particularly to the automated design of such systems.
A vast number of devices and appliances ranging from mobile phones, printers, and cars have embedded computer systems. The number of embedded computer systems in these devices far exceeds the number of general-purpose computer systems such as PCs or servers. In the future, the sheer number of these embedded computer systems will greatly exceed the number of general-purpose computer systems.
The design process for embedded computer systems is different from that for general purpose computer systems. There is greater freedom in designing embedded computer systems because there is often little need to adhere to standards in order to run a large body of existing software. Since embedded computer systems are used in very specific settings, they may be tuned to a much greater degree for certain applications. On the other hand, though there is greater freedom to customize and the benefits of customization are large, the revenue stream from a particular embedded computer system design is typically not sufficient to support a custom design.
In designing embedded computer systems, the general design space generally consists of a processor and associated Level-1 instruction, Level-1 data, and Level-2 unified caches, and main memory. The number and type of functional units in the processor may be varied to suit the application. The size of each of the register files may also be varied. Other aspects of the processor such as whether it supports speculation or predication may also be changed. For each of the caches, the cache size, the associativity, the line size and the number of ports can be varied. Given a subset of this design space, an application, and its associated data sets, a design objective is to determine a set of cost-performance optimal processors and systems. A given design is cost-performance optimal if there is no other design with higher performance and lower cost.
While designing the cache hierarchy, it is necessary to know how the processor acts because there is some dependence between the processor and the cache hierarchy. When both are being designed together, there is a severe problem because there are two subsystems and one subsystem is somewhat weakly dependent on the behavior of the other subsystem. Currently, evaluating a particular cache design for a particular processor design requires generating the address trace for that design and running this trace through a cache simulator. To design the overall computer system, it is necessary to take the cross-products of all possible cases of the cache subsystem first and cases of the processor subsystem second, and individually consider each of those cases, which is extremely time consuming.
Because of the multi-dimensional design space, the total number of possible designs can be very large. Even allowing a few of the processor parameters to vary, easily leads to a set of 40 or more processor designs. Similarly, there may be 20 or more possible cache designs for each of the three cache types.
For a typical test program, the sizes of the data, instruction, and unified traces are 450 M (million), 1200M, and 1650M, respectively, and the combined address trace generation and simulation process takes 2, 5, and 7 hours, respectively. Even in a design space with only 40 processors and only 20 caches of each type, each cache has to be evaluated with the address trace produced by each of the 40 processors. Thus, evaluating all possible combinations of processors and caches takes (40xc3x9720xc3x97(2+5+7)) hours which comes out to 466 days and 16 hours of around the clock computation. Such an evaluation strategy is clearly costly and unacceptable.
The present invention provides a system which simplifies and speeds up the process of designing a computer system by evaluating the components of the memory hierarchy for any member of a broad family of processors in an application-specific manner. The system uses traces produced by a reference processor in the design space for a particular cache design and characterizes the differences in behavior between the reference processor and an arbitrarily chosen processor. The differences are characterized as a series of xe2x80x9cdilationxe2x80x9d parameters which relate to how much the traces would expand because of the substitution of a target processor. In addition, the system characterizes the reference trace using a set of trace parameters that are part of a cache behavior model. The dilation and trace parameters are used to determine the factors for estimating the performance statistics of target processors with specific memory hierarchies. In a design space with 40 processors and 20 caches of each type, each cache hierarchy has to be evaluated with the address trace produced by only 1 of the 40 processors. Thus, evaluating all possible combinations of processors and caches only takes (1xc3x9720xc3x97(2+5+7)) hours or 11 days and 16 hours of computation rather than 466 days and 16 hours.
The present invention provides a process for determining the performance of a computer system for a specific target processor, application, and cache hierarchy. A user or separate design system is subsequently responsible for selecting the particular cache used at each level, based on the performance results provided by this process for a set of cache configurations.
The present invention further provides for simulation of all the target cache hierarchies of interest with respect to the reference processor and evaluation of the cache hierarchies with respect to any other target processors. The code characteristics of the reference processor and an arbitrarily selected processor are determined and used to derive the dilation parameters and factors to determine the performance statistics of the target processors.
The present invention still further provides a method for quickly determining the dilation parameters and factors.
The present invention still further provides for evaluation of general-purpose systems using the dilation parameters.
The present invention produces relative computer system performance metrics for any design point in a simulation-efficient manner, viz. the number of data, instruction and unified cache misses.