The present invention is related to the subject matter of the following commonly assigned, copending U.S. patent applications: Ser. No. 10/210,357 entitled “SPECULATIVE COUNTING OF PERFORMANCE EVENTS WITH REWIND COUNTER” and filed Jul. 31, 2002. The content of the above-referenced applications is incorporated herein by reference.
1. Technical Field
This invention relates to performance monitoring for a microprocessor, more particularly, to monitoring memory latency, and still more particularly to monitoring memory latency for a microprocessor having a hierarchical memory system.
2. Description of the Related Art
Processors often contain several levels of memory for performance and cost reasons. Generally, memory levels closest to the processor are small and fast, while memory farther from the processor is larger and slower. The level of memory closest to the processor is the Level 1 (L1) cache, which provides a limited amount of high speed memory. The next closest level of memory to the processor is the Level 2 (L2) cache. The L2 caches is generally larger than the L1 cache, but takes longer to access than the L1 cache. The system main memory is the level of memory farthest from the processor. Accessing main memory consumes considerably more time than accessing lower levels of memory.
When a processor requests data from a memory address, the L1 cache is examined for the data. If the data is present, it is returned to the processor. Otherwise, the L2 cache is queried for the requested memory data. If the data is not present, the L2 cache acquires the requested memory address data from the system main memory. As data passes from main memory to each lower level of memory, the data is stored to permit more rapid access on subsequent requests.
Additionally, many modem microprocessors include a Performance Monitor Unit (PMU). The PMU contains one ore more counters (PMCs) that accumulate the occurrence of internal events that impact or are related to the performance of a microprocessor. For example, a PMU may monitor processor cycles, instructions completed, or delay cycles executing a load from memory. These statistics are useful in optimizing the architecture of a microprocessor and the instructions executed by a microprocessor.
While a PMU may accumulate the number of delay cycles executing a load in a PMC, this value is not always useful as the count does not indicate how much each level of memory contributed to the count. Performance engineers are often interested in the contributions to the load delay by each level of memory. Currently, there is no method of crisply, or accurately counting, the number of delay cycles attributable to a particular level of memory in a hierarchical memory system.
The method currently used to determine delay cycles while accessing a particular level of memory involves setting a threshold value. As a processor is required to search memory levels farther away, the number of delay cycles increases noticeably. If the number of delay cycles versus level of memory were plotted, there would be sharp rises in the delay cycles for each level of memory moving away from the processor. Accordingly, the present method of determining delay cycles for a particular level of memory sets a threshold value depending on the level of memory to be measured.
Typically, the system main memory is first measured with a large threshold value since accesses to main memory take longer. If a load delay exceeds the threshold, then the delay is attributed to main memory. Having a delay cycle count for main memory, the next lower level of memory (assume L2) is measured. The threshold is set accordingly and all delays exceeding the threshold are counted. The count also includes delays from accessing main memory; however, since the number of delay cycles for main memory is already approximately known, the delay cycles for L2 is obtained by subtracting the delays cycle count for main memory from the count obtained using the threshold for L2. The process is repeated for each lower level of memory.
The problem with using a threshold to measure memory latency in a hierarchical memory system is that it does not accurately determine the delay for each level of memory and requires several passes to determine the delay cycle counts for lower levels of memory. A memory access to a lower level of memory may exceed the threshold for a higher level of memory under certain circumstances which would result in the delay being attributed to the incorrect level of memory.
Therefore, there is a need for a new and improved method for accurately counting the number of delay cycles attributable to a particular level of memory in a hierarchical memory system.