1. Field of the Invention
The present invention relates to techniques for measuring performance within a computer system. More specifically, the present invention relates to a method and an apparatus for performing software sampling on a microprocessor cache within a computer system while the computer system is operating.
2. Related Art
As microprocessor clock speeds continue to increase at an exponential rate, processor performance is becoming increasingly constrained by the delays involved in transferring instructions and data between memory and computational circuitry within the processor core. In order to alleviate this problem, copies of instructions and data items that are likely to be referenced are stored in local cache memories within the microprocessor chip. This allows the microprocessor to access the instructions and data items from the local cache memories, without the significant delay involved in accessing an off-chip main memory.
In order to optimize the performance of these microprocessor caches, it is necessary to measure the dynamic behavior of applications on these microprocessor caches. If this dynamic behavior can be accurately measured, the application developer (or the developer of an associated compiler) can modify the memory layout of the application to optimize the cache performance of the application. Alternatively, the microprocessor designer can adjust the cache structure, the cache size, or the cache replacement policy to optimize cache performance.
A number of techniques are presently being used to monitor cache performance. A hardware analyzer can monitor signal lines in the computer system, and can thereby determine cache performance within the computer system. Unfortunately, a hardware analyzer cannot monitor internal signals lines within the microprocessor chip. It can only monitor signals that are available on I/O pins of the microprocessor chip. Hence, a hardware analyzer is largely unable to monitor the dynamic behavior of on-chip microprocessor caches. Moreover, because of the tremendous clock speeds of modern microprocessors and because of memory limitations within the hardware analyzers, hardware analyzers are typically only able to record a few seconds worth of performance data.
Hardware counters that count cache misses can be incorporated into microprocessor caches. However, these hardware counters merely provide a cache miss rate, and do not indicate the cause of a cache miss.
Some diagnostic programs can determine instruction and data reference patterns for an application by performing trap operations for each instruction the application executes. During these trap operations, program counters and other information can be recorded to determine instruction and data reference patterns, and these reference patterns can be used to determine the dynamic behavior of the application on the microprocessor caches. Unfortunately, this technique is hundreds of times slower that normal execution of the application. Furthermore, this technique cannot be used to monitor system calls and other kernel operations associated with the application. This is a problem because many cache performance problems arise from interactions between the user application and the operating system, and these interactions cannot be detected through these diagnostic programs.
It is also possible to perform software sampling on a microprocessor cache. However, existing techniques for software sampling produce invalid results because the application performing the sampling displaces the application being measured from the microprocessor cache. Hence, the application performing the sampling measures itself rather than the application of interest.
Hence, what is needed is a method and an apparatus for measuring the dynamic behavior of applications on microprocessor caches without the problems of the existing techniques described above.
One embodiment of the present invention provides a system that facilitates sampling a cache in a computer system, wherein the computer system has multiple central processing units (CPUs), including a measured CPU containing the cache to be sampled, and a sampling CPU that gathers the sample. During operation, the measured CPU receives an interrupt generated by the sampling CPU, wherein the interrupt identifies a portion of the cache to be sampled. In response to receiving this interrupt, the measured CPU copies data from the identified portion of the cache into a shared memory buffer that is accessible by both the measured CPU and the sampling CPU. Next, the measured CPU notifies the sampling CPU that the shared memory buffer contains the data, thereby allowing the sampling CPU to gather and process the data.
In a variation on this embodiment, copying the data from the identified portion of the cache into the shared memory buffer involves saving the data from the identified portion of the cache into one or more registers within the measured CPU, and then storing the data from the one or more registers into the shared memory buffer.
In a further variation, storing the data from the one or more registers into the shared memory buffer involves bypassing a data cache within the measured CPU and storing the data directly into the shared memory buffer.
In a further variation, the one or more registers in the measured CPU are floating point registers. In this variation, prior to saving the data from the identified portion of the cache into the one or more registers, the measured CPU saves existing contents of the one or more registers. After the data is stored from the one or more registers into the shared memory buffer, the measured CPU restores the existing contents of the one or more registers.
In a further variation, prior to saving the data from the identified portion of the cache into the one or more registers, the measured CPU suspends a sampled application running on the measured CPU, and then saves the state of the sampled application into storage within the measured CPU. After the data is stored from the one or more registers into the shared memory buffer, the measured CPU restores the state of the sampled application from the storage within the measured CPU, and then resumes execution of the sampled application on the measured CPU.
In a variation on this embodiment, the data from the identified portion of the cache includes cache tag information associated with specified lines within the cache. Moreover, this cache tag information contains address and ownership information for the specified lines within the cache.
In a variation on this embodiment, the cache to be sampled in the measured CPU can include: an instruction cache, a data cache, a level-two (L2), a prefetch cache, a write cache, an instruction translation lookaside buffer (TLB), a data TLB, and a branch prediction table.
In a variation on this embodiment, there exists a different interrupt handling routine for each different cache that can be sampled within the measured CPU. Furthermore, the interrupt identifies a specific cache to be sampled within the measured CPU.