In typical computer systems utilizing processors, system developers desire optimization of execution software for more effective system design. Usually, studies of a program's access patterns to memory and interaction with a system's memory hierarchy are performed to determine system efficiency. Understanding the memory hierarchy behavior aids in developing algorithms that schedule and/or partition tasks, as well as distribute and structure data for optimizing the system.
Performance monitoring is often used in optimizing the use of software in a system. A performance monitor is generally regarded as a facility incorporated into a processor to monitor selected characteristics to assist in the debugging and analyzing of systems by determining a machine's state at a particular point in time. Often, the performance monitor produces information relating to the utilization of a processor's instruction execution and storage control. For example, the performance monitor can be utilized to provide information regarding the amount of time that has passed between events in a processing system. The information produced usually guides system architects toward ways of enhancing performance of a given system or of developing improvements in the design of a new system.
Prior art approaches to performance monitoring include the use of test instruments. Unfortunately, this approach is not completely satisfactory. Test instruments can be attached to the external processor interface, but these cannot determine the nature of internal operations of a processor. Test instruments attached to the external processor interface cannot distinguish between instructions executing in the processor. Test instruments designed to probe the internal components of a processor are typically considered prohibitively expensive because of the difficulty associated with monitoring the many busses and probe points of complex processor systems that employ pipelines, instruction prefetching, data buffering, and more than one level of memory hierarchy within the processors. A common approach for providing performance data is to change or instrument the software. This approach however, significantly affects the path of execution and may invalidate any results collected. Consequently, software-accessible counters are incorporated into processors. Most software-accessible counters, however, are limited in the amount of granularity of information they provide.
Further, a conventional performance monitor is usually unable to capture machine state data until an interrupt is signaled, so that results may be biased toward certain machine conditions that are present when the processor allows interrupts to be serviced. Also, interrupt handlers may cancel some instruction execution in a processing system where, typically, several instructions are in progress at one time. Further, many interdependencies exist in a processing system, so that in order to obtain any meaningful data and profile, the state of the processing system must be obtained at the same time across all system elements. Accordingly, control of the sample rate is important because this control allows the processing system to capture the appropriate state. It is also important that the effect that the previous sample has on the sample being monitored is negligible to ensure the performance monitor does not affect the performance of the processor. Accordingly, there exists a need for a system and method for effectively monitoring processing system performance that will efficiently and noninvasively identify potential areas for improvement. A more effective performance monitoring system has been disclosed in the cross-referenced applications noted above.
However, these systems are not wholly sufficient for all purposes and hence may be expanded upon in a way that assists architects and implementers in improving computer system performance through better understanding of the effect of the memory hierarchy on the performance of the processor in question.
Consider the linear performance model (or just linear model) that is standardly used to evaluate and compare performance of central processing units (CPUs). The equation is usually stated as follows: EQU CPI_finite=CPI_infinite+DC_miss_ratio*DC_miss_penalty+IC_miss_ratio*IC_miss _penalty
The following serves to define the six factors in the above equation:
CPI_finite=cycles per instruction of a given implementation when executing a particular workload PA0 CPI_infinite=the minimum cycles per instruction required on average to execute a given workload when the closest level of the memory hierarchy (typically the primary (L1) caches) always has the needed information PA0 DC_miss_ratio=number of L1 data cache misses per instruction on average PA0 IC_miss_ratio=number of L1 instruction cache misses per instruction on average PA0 DC_miss_penalty=Average number of cycles per L1 data cache miss per instruction PA0 IC_miss_penalty=Average number of cycles per L1 instruction cache miss per instruction
These six factors, specifically CPI_finite, CPI_infinite, DC_miss_ratio, IC_miss_ratio, DC_miss_penalty, and IC_miss penalty, shall be referred to as the CPU performance signature parameters, or for brevity, simply as the parameters or factors.
Clearly, any five of these factors will serve to define all six (i.e., if only one factor is not known, the known five will allow for the determination of the unknown sixth factor). In standard practice one desires to determine via measurement all of these factors except for CPI_infinite which is calculated. It is also possible to describe subsequent levels of cache or memory hierarchy (L2 (secondary), L3, or memory, disk, etc.). To simplify the discussion, these will not be considered, but a straightforward modification of the equation provides for these. For example: EQU CPI_finite=CPI_infinite+(L1_DC_miss_ratio-L2_DC_hit_ratio))*L1_DC_miss_pena lty+(L1_IC_miss_ratio-L2_IC_hit_ratio))*L1_IC_miss_penalty+L2_DC_miss_ratio *L2_DC_miss_penalty+L2_IC_miss_ratio*L2_IC_miss_penalty
In this case, there is the additional detail of the activity of the external cache (sometimes referred to as the L2 cache). For the purposes of this discussion, this detail will not consider this additional detail, though it is valid and meaningful to do so. In the remainder at this disclosure, the discussion will be restricted to the examination of the influence of L1 caches only, but it is understood that this discussion applies to any level of memory hierarchy using suitable extensions.
The usual approach in using the linear model is that one determines the factors for a given workload and then considers hardware/software modifications to these factors to understand the effect on the CPI. In particular, CPI_infinite is an estimate of the best case performance of the CPU with an ideal (though possibly very expensive) storage hierarchy and is an important characteristic of the CPU and workload of interest (measurement shows that the behavior of the workload and the CPU can not be separated in any meaningful manner). In particular, one supposes that a different memory subsystem design can reduce the storage access times by some amount. This change in the memory subsystem design will be reflected in the net delays for the various cache miss penalties. Thus, one can recompute the CPI_finite based on the different memory system design.
The rate of progress of the workload on a system depends on the number of instructions that can be executed per second. Since the number of instructions that must be executed is essentially invariant and known, the rate at which instructions execute determines the performance of a given workload on the system of interest.
Assuming that cost of a hypothesized memory system is known, the resultant system cost can be compared to the projected performance. Thus, product planners can have a better understanding of the price/performance trade-offs involved with various subsystem designs. In this manner, a system configuration can be more accurately determined with the result of best price/performance. The value of such knowledge is clear.
There are many cases in which a PowerPC 604 performance monitor (one example of a performance monitor) can provide most of the required parameters (excepting for CPI_infinite, which in the past has always been derived from the 5 remaining factors). However, there are cases where the CPI_infinite cannot be so determined, namely those cases where there is significant parallelism due to out of order execution. Advances in compiler and CPU technology is forcing this case to occur more and more frequently.
Hence, under the case of high instruction execution parallelism, knowing the time that a data cache miss is in progress is not sufficient to characterize the effect that a data cache miss has on average to CPI_finite. Likewise, a similar situation exists with instruction cache misses; parallelism confounds the ability to determine the true cost to the performance due to such cache misses. Therefore, there is a lack of an ability to understand the most crucial factors limiting CPU performance in current performance monitoring implementations. This limitation is a serious one because it prohibits one from quickly and accurately evaluating system performance and thereby confounds attempts to design systems exhibiting superior cost/performance trade-offs. Thus, there is a need to correct these shortcomings encountered when measuring processors capable of out of order execution.