Generally, a microprocessor operates much faster than main memory can supply data to the microprocessor. Therefore, many computer systems temporarily store recently and frequently used data in smaller, but much faster cache memory. There are many levels of cache, e.g., level one (L1), level two (L2), level three (L3), etc. L1 cache typically is closest to the processor, smaller in size, and faster in access time. Typically, as the level of the cache increases (e.g., from L1 to L2 to L3), the level of cache is further from the microprocessor, larger in size, slower in access time, and supports more microprocessors.
Referring to FIG. 1 and FIG. 2, a typical parallel processing computer system includes boards (20A–20N) connected to a L3 cache (22), i.e., an external cache memory. Each board (20A–20N) has, among other things, chips (14A–14N) and a L2 cache (18), i.e., an on-board cache L3 cache (22) is connected to a main memory (24). Main memory (24) holds data and program instructions to be executed by the microprocessor (8). Microchips (14A–14N) includes a microprocessors (8A–8N) that are associated with L1 cache (12A–12N), i.e., an on-chip cache memory. Virtual microprocessors are considered threads or placeholders for current processes associated with microprocessors (8A–8N). In addition to the microprocessors (8A–8N), virtual microprocessors (not depicted) on the microprocessors (8A–8N) may also use the L1 cache (12A–12N). One skilled in the art can appreciate that L1 cache may be associated with multiple virtual microprocessors rather than accessed by a single microprocessor as depicted in FIG. 2.
Program instructions that are usually stored in main memory are physical operations given by a program to the microprocessor (8A–8N), e.g., specifying a register or referencing the location of data in cache memory (either L1 cache (12), L2 cache (18), or L3 cache (22)). A sequence of program instructions linked together is known as a trace. The program instructions are executed by the microprocessor (8A–8N). Upon command from one of the microprocessors (14A–14N), data is searched, such as the program instruction, first in the L1 cache (12A–12N). If the data is not found in the L1 cache (12A–12N), the next searches the L2 cache (18). If the data is not found in the L2 cache (18), the then searches the L3 cache (22). If the data is not found in the L3 cache (22), the finally searches the main memory (24). Once the data is found, the memory controller returns the data to the microprocessor that issued the command. If the data is not found, an error message is returned to the microprocessor that issued the command.
Those skilled in the art will appreciate that the architecture of the cache may be structured in a variety of ways, e.g., the architectural components may include cache hierarchy, cache size, cache line size, cache associativity, cache sharing, and cache write type may be designed in a number of ways.
Cache hierarchy refers to the different levels of memory, i.e., L1 cache, L2 cache, etc., that take advantage of the “principle of locality.” The “principle of locality” asserts that most programs do not access data uniformly. Thus, the cache hierarchy may be designed using different types of cache memories (i.e., faster more expensive cache memory or slower less expensive cache memory) in conjunction with the “principle of locality” to improve computer system performance. As mentioned above, L1 cache is typically located on the same chip as the microprocessor while, in contrast, L2 cache is typically located on the same board as the microprocessor. Further, L2 cache is typically larger in size and has a slower access time than L1 cache.
Cache size refers to the total size of the cache memory. The cache memory is configured to store data in discrete blocks in the cache memory. A block is the minimum unit of information within each level of cache. The size of the block is referred to as the cache line size. The manner in which data is stored in the blocks is referred to as cache associativity. Cache memories typically use one of the following types of cache associativity: direct mapped (one-to-one), fully associative (one to all), or set associative (one to set).
Cache sharing of cache refers to the manner in which data in the blocks are shared. Specifically, L1 cache sharing is the number of processors (physical or virtual) sharing the L1 cache, i.e., the number of L1 caches sharing one L2 cache; and the number of L2 caches sharing one L3 cache, etc. Most program instructions involve accessing (reading) data stored in the cache memory; therefore, the cache associativity, cache sharing, cache size, and cache line size are particularly significant to the cache architecture.
Likewise, writing to the cache memory (cache write type) is also critical to the cache architecture, because the process of writing is generally a very expensive process in terms of process time. Cache memory generally uses one of the following methods when writing data to the cache memory: “write through, no-write allocate” or “write back, write allocate.”
In a parallel processing computer system, the issue of cache coherency is raised when writing to the cache memory that is shared by many processors.
Cache coherency resolves conflicts in multiple processors accessing and writing (or changing) the value of variables stored in the same cache memory. The following protocols are typically used to resolve cache coherency issues: Modified Shared Invalid (MSI), Modified Exclusive Shared Invalid (MESI), Modified Owner Shared Invalid (MOSI), Modified Owner Exclusive Shared Invalid (MOESI), etc. One skilled in the art will appreciate the particular aspects of these protocols and that other protocols can be used to resolve cache coherency issues.
The performance of the cache architecture is evaluated using a variety of parameters, including a miss rate, a hit rate, an instruction count, an average memory access time, etc. The miss rate is the fraction of all memory accesses that are not satisfied by the cache memory. There are a variety of miss rates, e.g., intervention, clean, total, “write back”, cast out, upgrade, etc. In contrast, the hit rate is the fraction of all memory accesses that are satisfied by the cache memory. The instruction count is the number of instructions processed in a particular amount of time. The average memory cache access time is the amount of time on average that is required to access data in a block of the cache memory.
Simulation is a useful tool in determining the performance of the cache architecture. Given workload traces (i.e., a set of traces which are executed by the microprocessors that emulate sets of typical instructions) and the cache architecture, the performance, e.g., hit/miss rates, of that cache architecture may be simulated. For a given set of cache architectural components, including a range of possible values for each cache architectural component, the number of permutations to fully simulate the cache architecture is very large. There are additional constraints when using simulation. For example, a trace characterizing each level of the number of processors of interest is required. However, some traces may be absent, or short traces that provide realistic scenarios do not sufficiently “warm-up” large cache sizes, i.e., a trace may not be long enough for the simulation to reach steady-state cache rates. Also, uncertainty in benchmark tuning is another example of constraints in simulation. Additionally, in the interest of time and cost, typically only a small sample set of cache architectures is simulated.
Once the simulation is performed on the small sample set of the cache architecture, statistical analysis is used to estimate the performance of the cache architectures that are not simulated. The quality of the statistical analysis relies on the degree to which the sample sets are representative of the sample space, i.e., permutations for a given set of cache architectural components. Sample sets are generated using probabilistic and non-probabilistic methods. Inferential statistics along with data obtained from the sample set are then used to model the sample space for the given architectural components. Models are typically used to extrapolate using the data obtained from the sample set. The models used are typically univariate or multivariate in nature. The univariate model is analysis of a single variable and are generally useful to describe relevant aspects of data. The multivariate model is analysis of one variable contingent on the values of other variables. Further, the models used to fit the data of the sample set may be smoothed models obtained using a plurality of algorithms.
Designers, computer architects, etc. rely on simulation and analytical tools, that use the methods of statistical analysis previously mentioned, to characterize or optimize the performance of cache architectures for a given workload. The method typically used by computer architects, designers, etc. is shown in FIG. 3.
An experimenter, e.g., designers, computer architect, etc., designs the cache architecture experiment (Step 21), e.g., determining the sample space of the cache architecture of interest, including the range of values of the architectural components involved in the cache architecture. Then, a sample set is determined ad hoc (Step 23). The sample set is particular cache architectures within the defined range chosen subjectively by the experimenter. Next, the sample set of cache architectures is simulated (Step 25) and simulation data is generated. The simulation data is modeled using a univariate model (Step 27), i.e., using single variable analysis. The output of the univariate model can be used in a system model. The results from the system model may be used to generate graphs, tables, charts, etc. to formulate analysis of the cache architecture for a particular range and a particular workload.