The present invention relates to computers and, more particularly, to a method for selecting a cache design for a computer system. A major objective of the invention is to provide a method for quantitatively estimating the performance of alternative cache designs for incorporation in a given computer system.
Much of modern progress is associated with the proliferation of computers. While much attention is focussed on general-purpose computers, application-specific computers are even more prevalent. (Application-specific computers typically incorporate one or more customed-designed integrated circuitsxe2x80x94referred to as xe2x80x9capplication-specific integrated circuitsxe2x80x9d or xe2x80x9cASICsxe2x80x9d) Such application-specific computers can be found in new device categories, such as video games, and in advanced versions of old device categories, such as televisions.
A typical computer includes a processor and main memory. The processor executes program instructions, many of which involve the processing of data. Instructions are read from main memory, and data is read from and written to main memory. Advancing technology has provided faster processors and faster memories. As fast as memories have become, they remain a computational bottleneck; processors often have to idle while requests are filled from main memory.
Caches are often employed to reduce this idle time. Caches intercept requests to main memory and attempt to fulfill those requests using memory dedicated to the cache. To be effective, caches must be able to respond much faster than main memory; to achieve the required speed, caches tend to have far less capacity than does main memory. Due to their smaller capacity, caches can normally hold only a fraction of the data and instructions stored in main memory. An effective cache must employ a strategy that provides that the probability of a request for main-memory locations stored in the cache is much greater than the probability of a request for main-memory locations not stored in the cache.
There are many types of computer systems that use caches. A single pedagogical example is presented at this point to illustrate some of the issues regarding selection of a cache design. The application is a xe2x80x9cset-topxe2x80x9d box designed to process digital television signals in accordance with inputs received from the signal itself, from panel controls, and from remote controls over a digital infrared link. The set top box includes a 100 MHz 32-bit processor. This processor accesses instructions and data in 32-bit words. These words are arranged in 220 addressable 32-bit word locations of main-memory. Program instructions are loaded into main memory from. flash memory automatically when power is turned on. The processor asserts 30-bit word addresses; obviously, only a small fraction of these correspond to physical main memory locations.
A single cache design can involve one or more caches. There are level-1 and level-2 caches. In a Harvard architecture, there can be separate caches for data and for instructions. In addition, there can be a write buffer, which is typically a cache used to speed up write operations, especially, in a write-through mode. Also, the memory management units for many systems can include a translations-look-aside buffer (TLB), which is typically a fully associative cache.
In the pedagogical example, the cache is an integrated data/instruction cache with an associated write buffer. The main cache is a 4-way set associative cache with 210 addressable 32-bit word locations. These are arranged in four sets. Each set has 26 line locations, each with a respective 6-bit index. Each line location includes four word locations.
When the processor requests a read from a main-memory address, the cache checks its own memory to determine if there is a copy of that main memory location in the cache. If the address is not represented in the cache, a cache xe2x80x9cmissxe2x80x9d occurs. In the event of a miss, the cache fetches the requested contents from main memory. However, it is not just the requested word that is fetched, but an entire four-word line (having a line address constituted by the most significant 28 bits of the word address).
This fetched line is stored in a line location of the cache. The line must be stored at a cache line location having an index that matches the six least significant bits of the address of the fetched line. There is exactly one such location in each of the four cache sets; thus, there are four possible storage locations for the fetched line. A location without valid contents is preferred for storing the fetched line over a location with valid data. A location with less recently used contents is preferred to one with more recently used data. In the event of ties, the sets are assigned an implicit order so that the set with the lowest implicit order is selected for storing the fetched line.
The cache includes a write buffer that is used to pipeline write operations to speed up write operations in write-through mode. In write-though mode processor writes are written directly to main memory. The write buffer is one-word (32 bits) wide, and four words deep. Thus, the processor can issue four write requests and then attend to other tasks while the cache fulfills the requests in the background.
The question then arises: xe2x80x9cIs this cache design optimal for the incorporating system?xe2x80x9d Would a larger cache provide a big enough performance advantage to justify the additional cost (financial, speed, complexity, chip space, etc.)? Would a smaller cache provide almost the same performance at a significantly lower cost? Would the cache be more effective in arranged as a two-way set associative cache, or possibly as an eight-way set-associative cache? Should the line length be increased to eight words or even to sixteen words. Should the write buffer be shallower or deeper? Should the write buffer have a different width? (Probably not in this case; but write buffer width is an issue in systems where the processor asserts requests with different widths.)
In the event of a read miss, there are alternative policies for determining which set is to store a fetched line. Also, there are strategies that involving fetching lines even when there is no miss because a request for an address not represented in the cache is anticipated. In the event of a write hit, should the data written to cache be written immediately back to main memory, or should the write-back wait until the corresponding cache location is about to be overwritten. In the event of a write miss, should the data just be written to main memory and the cache left unchanged, or should the location written to in main memory be fetched so that it is now represented in the cache.
The rewards for cache optimization can be significant. Cache optimization, especially in application-specific computers where one program is run repeatedly, can result in significant performance enhancements. Achieving such performance enhancements by optimizing cache design as opposed to increasing processor speeds can be very cost effective. Increased processor speeds can require higher cost processors, increased power requirements, and increased problems with heat dissipation. In contrast, some cache optimizations, such as those involving rearranging a fixed cache memory size, are virtually cost free (on a post set-up per unit basis).
The challenge is to find a method of optimizing a cache design that is both effective and cost-effective. While a selection can be made as an xe2x80x9ceducated guessxe2x80x9d, there is little assurance that the selected design is actually optimal. In competitive applications, some sort of quantitative comparative evaluation of alternative cache designs is called for.
In a multiple-prototype approach, multiple prototype systems with different cache designs are built and their performances are compared under test conditions that are essentially the same as the intended operating conditions. This multiple-prototype approach provides a very accurate comparative evaluation of the tested alternatives. However, since the costs (time and money) of a prototype system tend to be high, it is impractical to test a large number designs this way. If only a few designs are tested, there is a high likelihood that an optimal design will not be testedxe2x80x94and thus not selected.
Instead of building hardware prototypes of the systems with the various caches being considered, a multiple-simulations approach develops software models of the systems with alternative cache designs. The model is typically written in a general-purpose computer language such a C or C++, or a hardware description language such as VHDL or Verilog. Such a model can accurately count clock cycles required for each operation. A software version of an intended ROM-based firmware program can be executed on these software models. The simulations then provide comparative performance data for the different cache design selections. The simulation approach tends to be much less expensive and much less consuming that the multiple-prototype approach. Thus, this multiple-simulations approach allows more alternative cache designs to be considered for a given cost in time and money. Therefore, the set of designs tested is likely to include a more optimal cache design.
On the other hand, the results in the multiple-simulation approach can be less valid that the results of the multiple-prototype approach. One problem is that the program is run in simulation many orders of magnitude slower than it is to be run in hardware in the final system. It can be difficult to simulate certain types of signal events in the slower time frame. For example, television signals can be difficult to simulate. In particular, it might be difficult for the simulation to represent the frequency with which interrupts are generated; the frequency and nature of interrupts can have a substantial effect of comparative performance of cache designs.
The slow time frame not only causes a problem with the validity of cache performance measures, but also causes the simulations to be orders of magnitude more time consuming that the program executions on a prototype. For example, each simulation can consume several days of computer time. While less than is consumed in building a prototype, this time is enough to discourage testing of many alternative cache designs. This limitation makes it difficult to optimize cache design.
A cacheless-model trace-generation approach allows many cache designs to be compared in a manner that is efficient in terms of both cost and time. The trace-generation method involves building a relatively simple model of the system without a cache. The test program is run in simulation on the model. Instead of counting clock cycles, a trace is generated. The trace is a log of communications between the processor and main memory. A computer program, typically written in C, is then used to analyze this trace and determine the performance of various cache designs.
The cacheless-model trace-generation approach does not require the building of a prototype, and the test program is run in simulation only once. Also, the model is simpler and more readily generated than models used in the multiple-simulation approach. Program execution is less consuming than in the multiple-simulations approach since clock cycles do not need to be counted. The cache evaluation program is relatively quick, allowing many alternatives to be evaluated and compared.
The major problem with the trace-generation approach is that the results are the least accurate. The model used to generate the trace shares the problem of the multiple-simulation approach that the time frame of the execution of the test program is unrealistic. The trace approach further suffers since model on which the program is executed is simpler and thus less accurate than the models (which incorporate the caches to be evaluated) used in the multiple-simulation approach.
Considered as a series, the three approaches, the cacheless-model trace-generation approach the multiple-simulation approach, and the multiple-prototype approach provide increasing accuracy of evaluations at increasing costs in terms of time and money. What is needed is an approach that permits a more favorable tradeoff between cost and accuracy. Such a method should allow many different cache designs to be quantitatively evaluated at a reasonable cost, but with greater accuracy than is available using the simple-model trace-generation approach.
The present invention provides a seed-cache-model trace approach that combines the simple-model trace-generation approach with either one of the multiple-prototype approach or the multiple-simulation approach. In either case, the invention provides that a model of a system including a processor design, a xe2x80x9cseedxe2x80x9d cache design, and a trace-detection module be constructed. In one realization of the invention, the model is a software model, as it would be in the multiple-simulation approach. In a preferred realization of the invention, the model is a hardware prototype that includes the processor, seed cache, and trace-detection module on a single integrated circuit.
A test program is executed on the model in a manner appropriate to the type of model. However, unlike the multiple-prototype approach and the multiple-simulation approach, the simulation is not used (primarily) to evaluate the seed-cache design. Instead, a trace of communications between the processor and the seed cache is captured. A program, essentially the same as used in the cacheless-model trace-generation approach, is then used to evaluate different cache designs. The seed cache is not considered primarily as a candidate cache (although it can be one of the candidates) but as a means for obtaining a more accurate trace. This allows the evaluations of caches other than the seed cache to be more accurate.
In the preferred realization of the invention, the model is a hardware model rather than a software model. Trace capture involves tapping the signal paths between the processor and the seed cache. Since it can be assumed that the cache processor signal lines are optimized for speed (e.g., they are as short as possible) and are heavily utilized, it is problematic to transmit all the information along these signal lines to a remote trace capture module. To reduce the amount of data to be transmitted to the trace capture module, the trace data is compressed locally.
Significant compression can be achieved using several techniques. One technique takes advantage of the fact that the contents of many signals are predetermined. For example, many of the communications represent main memory addresses. Furthermore, these addresses often appear in consecutive series, so the data can be compressed, for example, by comparing each address with an expected address that is one unit higher than the previous address. Another technique takes advantage of knowledge of the contents of memory locations; for example, the contents of memory locations holding instructions are known ahead of time. Therefore, when a memory location is accessed, the compression scheme can simply affirm that the contents fulfilling the request are as expected.
A major advantage of the invention over the multiple-prototype approach and the multiple-simulation approach is that only one model is required and the application program need only be run once to evaluate many cache designs. A major advantage over the cache-less model trace-generation approach is that the results are based on more valid traces. In the preferred hardware realization of the invention, the trace is obtained at speeds and in an environment that can be as close as desired to the target application. Thus, with one model and one run of an application program, many different cache designs can be evaluated with enhanced accuracy. These and other features and advantages of the invention are apparent from the description below with reference to the following drawings.