Computer processors are getting faster, yet software application performance is not keeping pace. For large commercial applications, average processor cycles-per-instruction (CPI) values may be as high as 2.5 or 3. With a four-way instruction issue processor, a CPI of three means that only one issue slot in every twelve is being put to good use. It is important to understand why software throughput is not keeping up with hardware improvements.
It is common to blame such problems on memory latencies, in fact, many software applications spend many cycles waiting for data transfers to complete. However, other problems, such as branch mispredicts also waste processor cycles. Independent of the general causes, system architects, and hardware and software engineers need to know which instructions are stalling and why in order to improve the performance of modem computer systems incorporating complex processors.
Typically, this is done by generating a "profile" of the behavior of a system while it is operating. A profile is a record of performance data. Frequently, the profile is presented graphically so that performance bottlenecks can readily be identified.
Profiling can be done by instrumentation and simulation. With instrumentation, additional code is added to a program to monitor specific events during execution of a program. Simulation attempts to emulate the behavior of the entire program in an artificial environment rather than executing the program in the real system.
Each of these two methods has its drawbacks. Instrumentation perturbs the program's true behavior due to the added instructions and extra data references. Simulation avoids perturbation at the expense of a substantial performance overhead when compared to executing the program on a real system. Furthermore, with either instrumentation or simulation, it is usually difficult to profile an entire large scale software system, i.e., application, operating system, and device driver code.
Hardware implemented event sampling can also be used to provide profile information of processors. Hardware sampling has a number of advantages over simulation and instrumentation: it does not require modifying software programs to measure their performance. Sampling works on complete systems, with a relatively low overhead. Indeed, recently it has been shown that low-overhead sampling-based profiling can be used to acquire detailed instruction-level information about pipeline stalls and their causes. However, many hardware sampling techniques lack flexibility because they are designed to measure specific events.
Most extant microprocessors, such as the DIGITAL Alpha AXP 21164, the Intel Pentium Pro, and the MIPS R10000 provide event counters that can count a variety of events, such as data cache (D-cache) misses, instruction cache (I-cache) misses, and branch mispredicts. The event counters generate an interrupt when the counters overflow so that the performance data in the counters can be sampled by higher levels of software.
Event counters are useful for capturing aggregate information, such as the number of branch mispredicts that the system incurred while executing a particular program, or part thereof. However, known event counters are less useful for attributing state information to individual instructions, such as which branch instructions are frequently mispredicted. This may be due to the fact that the program counters (PC) of instructions that caused the events may no longer be available when the event counter overflows and interrupts.
It is a particular problem to deduce the dynamic operation of a processor that can issue instructions out-of-order. Indeed, the behavior of software programs executing in an out-of-order processor can be quite subtle and difficult to understand. Consider the flow of instructions in the out-of-order Alpha 21264 processor as a concrete example.
Superscalar Processor Architecture
Execution Order
An out-of-order processor fetches and retires instructions in order, but processes the instructions according to their data dependencies. Processing instructions can involve register mapping, instruction issuing and executing. An instruction is said to be "in-flight" from the time it is fetched until it retires or aborts.
During each processor cycle, a first stage of the processor pipeline fetches a set of instructions from the instruction cache (I-cache). The set of instructions are decoded. The instruction decoder identifies which instructions in the fetched set are part of the instruction stream.
Because it may take multiple cycles to resolve the PC of a next instruction to fetch, the PC is usually predicted ahead of time by a branch or jump predictor. When the prediction is incorrect, the processor will abort the mispredicted instructions which occupy a "bad" execution path, and will restart fetching instructions on the "good" path.
To allow instructions to execute out-of-order, registers specified in operands of instructions are dynamically renamed to prevent write-after-read and write-after-write conflicts. This renaming is accomplished by mapping architectural or "virtual" registers to physical registers. Thus, two instructions that write the same virtual register can safely execute out-of-order because they will write to different physical registers, and consumers of the virtual registers will get the proper values.
A register mapped instruction resides in the issue queue until its operands have been computed and a functional "execution" unit of the appropriate type is available. The physical registers used by an instruction are read in the cycle that the instruction issues. After instructions have executed, they are marked as ready to retire and will be retired by the processor when all previous ready-to-retire instructions in program order have been retired, i.e., instructions retire in the correct program order. Upon retirement, the processor commits the changes made by the instruction to the architectural "state" of the system, and releases resources consumed by the instruction.
Misprediction
In some cases, such as when a branch is mispredicted, instructions must be trapped or discarded. When this occurs, the current speculative architectural state is rolled back to a point in the execution where the misprediction occurred, and fetching continues at the correct instruction.
Delays
Numerous events may delay the execution of an instruction. At the front of the pipeline, the fetch unit may stall due to an I-cache miss, or the fetch unit may fetch instructions along a bad path due to a misprediction. The mapper may stall due to lack of free physical registers, or lack of free slots in the issue queue. Instructions in the issue queue may wait for their register dependencies to be satisfied, or for the availability of functional execution units.
Instructions may stall due to data cache misses. Instructions may trap because they were speculatively issued down a bad path, or because the processor took an interrupt. Many of these events are difficult to predict statically, e.g, by an examination of the code, and all of them degrade the performance of the system. Simple event counters are inadequate to capture this type of state information. In addition, it is difficult to exactly measure the lengths of the delays to determine which delays deserve special attention.
It is highly desirable to directly attribute events to specific instructions and machine states so that programmers, or optimization tools can improve the performance of software and hardware components of complex computer systems such as super-scalar and out-of-order processors, or for that matter processors of any architectural design.
Problems with Prior Art Event Counters
The main problem with known event counters is that the instruction that caused the event that overflowed the counter was usually fetched long before the exception PC, i.e., the PC is not of the instruction that caused the overflow. The length of the delay between the fetch and interrupt is generally an unpredictable amount. This unpredictable distribution of events makes it difficult to properly attribute events to specific instructions. Out-of-order and speculative execution amplifies this problem, but it is present even on in-order machines such as the Alpha 21164 processor.
For example, compare program counter values delivered to the performance counter interrupt handler while monitoring D-cache reference-event counts for the Alpha 21164 (in-order) processor vs. the Pentium Pro (out-of-order) processor. An example program consists of a loop containing a random memory access instruction, for example a load instruction, followed by hundreds of null operation instructions (nop).
On the in-order Alpha processor, all performance counter events (for example, cache misses) are attributed to the instruction that is executing six cycles after the event to result in a large peak of samples on the seventh instruction after the load access. This skewed distribution of events is not ideal. However, because there exists a single large peak, static analysis can sometimes work backwards from this peak to identify the actual instruction that caused the event, but this is still only nothing more than a best guess for a fairly simple program.
For the identical program executing on the out-of-order Pentium Pro, the event samples are widely distributed over the next 25 instructions, illustrating not only skewing but significant smearing as well. The wide distribution of samples makes it nearly impossible to attribute a specific event to the particular instruction that caused the event. Similar behavior occurs when counting other hardware events.
In addition to the skewed or smeared distribution of event samples, traditional event counters also suffer from additional problems. There usually are many more events of interest than there are event counters, making it difficult, if not impossible to concurrently monitor all interesting events. The increasing complexity of processors is likely to exacerbate this problem.
In addition, event counters only record the fact that an event occurred; they do not provide any additional state information about the event. For many kinds of events, additional information, such as the latency to service a cache miss event, would be extremely useful.
Furthermore, prior art counters generally are unable to attribute events to "blind spots" in the code. A blind spot is any non-interruptible code, such as high-priority system routines and PAL code, because the event will not be recognized until its interrupt is honored. By that time, the processor state may have changed significantly, most likely giving false information.
Stalls vs. Bottlenecks
On a pipelined, in-order processor, one instruction stalling in a pipeline stage prevents later instructions from passing through that pipeline stage. Therefore it is relatively easy to identify "bottleneck" instructions on an in-order processor, that is bottleneck instructions tend to stall somewhere in the pipeline. For an in-order processor, it is possible to identify stalls by measuring the latency of an instruction as it passes through each pipeline stage, and comparing the measured latency to the ideal latency of that instruction in each pipeline stage. An instruction can be presumed to have stalled in a stage when it takes longer than the minimum latency to pass through that stage.
However, on an out-of-order processor, other instructions may pass through a pipeline stage around an instruction that is stalled in that pipeline stage. In fact, the additional latency of the stalled instruction may be completely masked by the processing of other instructions, and, in fact, stalled instructions may not delay the observed completion of the program.
Even on in-order processors, stalls in one pipeline stage may not contribute to the overall execution time of a program when another pipeline stage is the bottleneck. For example, during the execution of a memory-intensive program, the fetcher and mapper of the instruction pipeline may often stall because of "back-pressure" from an execution unit delayed by D-cache misses.
Ideally, one would like to classify the memory operations causing the cache misses as the primary bottlenecks. The fetcher and mapper stalls are actually asymptomatic of the delays due to cache misses, that is, secondary bottlenecks.
It would be desired to identify those instructions whose stalls are not masked by other instructions, and to identify them as true bottlenecks. Furthermore, in order to improve program behavior, there is a need to focus on the causal (primary) bottlenecks rather than the symptomatic (secondary) bottlenecks. This classification of pipeline stage bottlenecks as causal and asymptomatic requires detailed knowledge of the state of the pipeline and the data and resource dependencies of the in-flight instructions which cannot be obtained from simple event counters as are known.
U.S. Pat. No. 5,151,981 "Instruction Sampling Instrumentation," issued to Wescott et al. on Sep. 29, 1992 proposes a hardware mechanism for instruction-based sampling in an out-of-order execution machine. There are a number of drawbacks in the approach taken by Wescott et al. First, their approach can bias the stream of instruction samples depending on the length of the code being sampled and the sampling rate. Second, their system only samples retired instructions, and not all instructions fetched, some of which may be aborted. Third, the information collected by the Wescott et al. mechanism focuses on individual event attributes, e.g., cache misses, but does not provide useful information for determining inter-instruction relationships.
More recently, a hardware mechanism called "informing loads" has been proposed, please see, Horowitz et al, "Informed memory operations: Providing memory performance feedback in modern processors," Proceedings 23rd Annual International Symposium on Computer Architecture, pp. 260-270, May 22, 1996. There, a memory operation can be followed by a conditional branch operation that is taken if and only if the memory operation misses in the cache. Although not specifically designed for profiling, that mechanism could be used to specifically gather just D-cache missed event information.
In other specialized hardware, called a cache miss look-aside (CML) buffer, virtual memory pages that suffer from a high level-2 cache miss rate are identified, see Bershad et al. "Avoiding conflict misses dynamically in large direct-mapped caches," Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 158-170, Oct. 4, 1994, for a full description.
Some processors, such as the Intel Pentium, permit software to read the contents of the branch predictor's branch target buffer (BTB). By periodically reading the BTB in software, Conte et al. developed a very low overhead technique to estimate edge execution frequencies of a program, see "Using branch handling hardware to support profile-driven optimization," Proceedings of the 27th Annual International Symposium on Microarchitecture, pp. 12-21, Nov. 30, 1994.
That approach yields information that is similar to that which could be obtained by keeping track of the branch direction information contained in a "profile record" storing related sampling information. More recently, Conte et al. proposed a piece of additional hardware called a profile buffer which counts the number of times a branch is taken and not-taken, see "Accurate and practical profile-driven compilation using the profile buffer," Proceedings of the 29th Annual International Symposium on Microarchitecture, pp. 36-45, Dec. 2, 1996.