The present invention relates generally to optimizing the performance of a computer system, and more particularly to scheduling execution threads.
Computer processors are getting faster, yet software application performance is not keeping pace. For large commercial applications, average processor cycles-per-instruction (CPI) values may be as high as 2.5 or 3. With a four-way instruction issue processor, a CPI of three means that only one issue slot in every twelve is being put to good use. It is important to understand why software throughput is not keeping up with hardware improvements.
It is common to blame such problems on memory latencies, in fact, many software applications spend many cycles waiting for data transfers to complete. However, other problems, such as branch mispredicts also waste processor cycles. Independent of the general causes, system architects, and hardware and software engineers need to know which instructions are stalling and why in order to improve the performance of modem computer systems incorporating complex processors.
Typically, this is done by generating a xe2x80x9cprofilexe2x80x9d of the behavior of a system while it is operating. A profile is a record of performance data. Frequently, the profile is presented graphically so that performance bottlenecks can readily be identified.
Profiling can be done by instrumentation and simulation. With instrumentation, additional code is added to a program to monitor specific events during execution of a program. Simulation attempts to emulate the behavior of the entire program in an artificial environment rather than executing the program in the real system.
Each of these two methods has its drawbacks. Instrumentation perturbs the program""s true behavior due to the added instructions and extra data references. Simulation avoids perturbation at the expense of a substantial performance overhead when compared to executing the program on a real system. Furthermore, with either instrumentation or simulation, it is usually difficult to profile an entire large scale software system, i.e., application, operating system, and device driver code.
Hardware implemented event sampling can also be used to provide profile information of processors. Hardware sampling has a number of advantages over simulation and instrumentation: it does not require modifying software programs to measure their performance. Sampling works on complete systems, with a relatively low overhead. Indeed, recently it has been shown that low-overhead sampling-based profiling can be used to acquire detailed instruction-level information about pipeline stalls and their causes. However, many hardware sampling techniques lack flexibility because they are designed to measure specific events.
Most extant microprocessors, such as the DIGITAL Alpha AXP 21164, the Intel Pentium Pro, and the MIPS R10000 provide event counters that can count a variety of events, such as data cache (D-cache) misses, instruction cache (I-cache) misses, and branch mispredicts. The event counters generate an interrupt when the counters overflow so that the performance data in the counters can be sampled by higher levels of software.
Event counters are useful for capturing aggregate information, such as the number of branch mispredicts that the system incurred while executing a particular program, or part thereof. However, known event counters are less useful for attributing state information to individual instructions, such as which branch instructions are frequently mispredicted. This may be due to the fact that the program counters (PC) of instructions that caused the events may no longer be available when the event counter overflows and interrupts.
It is a particular problem to deduce the dynamic operation of a processor that can issue instructions out-of-order. Indeed, the behavior of software programs executing in an out-of-order processor can be quite subtle and difficult to understand. Consider the flow of instructions in the out-of-order Alpha 21264 processor as a concrete example.
Superscalar Processor Architecture
Execution Order
An out-of-order processor fetches and retires instructions in order, but processes the instructions according to their data dependencies. Processing instructions can involve register mapping, instruction issuing and executing. An instruction is said to be xe2x80x9cin-flightxe2x80x9d from the time it is fetched until it retires or aborts.
During each processor cycle, a first stage of the processor pipeline fetches a set of instructions from the instruction cache (I-cache). The set of instructions are decoded. The instruction decoder identifies which instructions in the fetched set are part of the instruction stream.
Because it may take multiple cycles to resolve the PC of a next instruction to fetch, the PC is usually predicted ahead of time by a branch or jump predictor. When the prediction is incorrect, the processor will abort the mispredicted instructions which occupy a xe2x80x9cbadxe2x80x9d execution path, and will restart fetching instructions on the xe2x80x9cgoodxe2x80x9d path.
To allow instructions to execute out-of-order, registers specified in operands of instructions are dynamically renamed to prevent write-after-read and write-after-write conflicts. This renaming is accomplished by mapping architectural or xe2x80x9cvirtualxe2x80x9d registers to physical registers. Thus, two instructions that write the same virtual register can safely execute out-of-order because they will write to different physical registers, and consumers of the virtual registers will get the proper values.
A register mapped instruction resides in the issue queue until its operands have been computed and a functional xe2x80x9cexecutionxe2x80x9d unit of the appropriate type is available. The physical registers used by an instruction are read in the cycle that the instruction issues. After instructions have executed, they are marked as ready to retire and will be retired by the processor when all previous ready-to-retire instructions in program order have been retired, i.e., instructions retire in the correct program order. Upon retirement, the processor commits the changes made by the instruction to the architectural xe2x80x9cstatexe2x80x9d of the system, and releases resources consumed by the instruction.
Misprediction
In some cases, such as when a branch is mispredicted, instructions must be trapped or discarded. When this occurs, the current speculative architectural state is rolled back to a point in the execution where the misprediction occurred, and fetching continues at the correct instruction.
Delays
Numerous events may delay the execution of an instruction. At the front of the pipeline, the fetch unit may stall due to an I-cache miss, or the fetch unit may fetch instructions along a bad path due to a misprediction. The mapper may stall due to lack of free physical registers, or lack of free slots in the issue queue. Instructions in the issue queue may wait for their register dependencies to be satisfied, or for the availability of functional execution units.
Instructions may stall due to data cache misses. Instructions may trap because they were speculatively issued down a bad path, or because the processor took an interrupt. Many of these events are difficult to predict statically, e.g, by an examination of the code, and all of them degrade the performance of the system. Simple event counters are inadequate to capture this type of state information. In addition, it is difficult to exactly measure the lengths of the delays to determine which delays deserve special attention.
It is highly desirable to directly attribute events to specific instructions and machine states so that programmers, or optimization tools can improve the performance of software and hardware components of complex computer systems such as super-scalar and out-of-order processors, or for that matter processors of any architectural design.
Problems With Prior Art Event Counters
The main problem with known event counters is that the instruction that caused the event that overflowed the counter was usually fetched long before the exception PC, i.e., the PC is not of the instruction that caused the overflow. The length of the delay between the fetch and interrupt is generally an unpredictable amount. This unpredictable distribution of events makes it difficult to properly attribute events to specific instructions. Out-of-order and speculative execution amplifies this problem, but it is present even on in-order machines such as the Alpha 21164 processor.
For example, compare program counter values delivered to the performance counter interrupt handler while monitoring D-cache reference-event counts for the Alpha 21164 (in-order) processor vs. the Pentium Pro (out-of-order) processor. An example program consists of a loop containing a random memory access instruction, for example a load instruction, followed by hundreds of null operation instructions (nop).
On the in-order Alpha processor, all performance counter events (for example, cache misses) are attributed to the instruction that is executing six cycles after the event to result in a large peak of samples on the seventh instruction after the load access. This skewed distribution of events is not ideal. However, because there exists a single large peak, static analysis can sometimes work backwards from this peak to identify the actual instruction that caused the event, but this is still only nothing more than a best guess for a fairly simple program.
For the identical program executing on the out-of-order Pentium Pro, the event samples are widely distributed over the next 25 instructions, illustrating not only skewing but significant smearing as well. The wide distribution of samples makes it nearly impossible to attribute a specific event to the particular instruction that caused the event. Similar behavior occurs when counting other hardware events.
In addition to the skewed or smeared distribution of event samples, traditional event counters also suffer from additional problems. There usually are many more events of interest than there are event counters, making it difficult, if not impossible to concurrently monitor all interesting events. The increasing complexity of processors is likely to exacerbate this problem.
In addition, event counters only record the fact that an event occurred; they do not provide any additional state information about the event. For many kinds of events, additional information, such as the latency to service a cache miss event, would be extremely useful.
Furthermore, prior art counters generally are unable to attribute events to xe2x80x9cblind spotsxe2x80x9d in the code. A blind spot is any non-interruptible code, such as high-priority system routines and PAL code, because the event will not be recognized until its interrupt is honored. By that time, the processor state may have changed significantly, most likely giving false information.
Stalls vs. Bottlenecks
On a pipelined, in-order processor, one instruction stalling in a pipeline stage prevents later instructions from passing through that pipeline stage. Therefore it is relatively easy to identify xe2x80x9cbottleneckxe2x80x9d instructions on an in-order processor, that is bottleneck instructions tend to stall somewhere in the pipeline. For an in-order processor, it is possible to identify stalls by measuring the latency of an instruction as it passes through each pipeline stage, and comparing the measured latency to the ideal latency of that instruction in each pipeline stage. An instruction can be presumed to have stalled in a stage when it takes longer than the minimum latency to pass through that stage.
However, on an out-of-order processor, other instructions may pass through a pipeline stage around an instruction that is stalled in that pipeline stage. In fact, the additional latency of the stalled instruction may be completely masked by the processing of other instructions, and, in fact, stalled instructions may not delay the observed completion of the program.
Even on in-order processors, stalls in one pipeline stage may not contribute to the overall execution time of a program when another pipeline stage is the bottleneck. For example, during the execution of a memory-intensive program, the fetcher and mapper of the instruction pipeline may often stall because of xe2x80x9cback-pressurexe2x80x9d from an execution unit delayed by D-cache misses.
Ideally, one would like to classify the memory operations causing the cache misses as the primary bottlenecks. The fetcher and mapper stalls are actually asymptomatic of the delays due to cache misses, that is, secondary bottlenecks.
It would be desirable to identify those instructions whose stalls are not masked by other instructions, and to identify them as true bottlenecks. Furthermore, in order to improve program behavior, there is a need to focus on the causal (primary) bottlenecks rather than the symptomatic (secondary) bottlenecks. This classification of pipeline stage bottlenecks as causal and asymptomatic requires detailed knowledge of the state of the pipeline and the data and resource dependencies of the in-flight instructions which cannot be obtained from simple event counters as are known.
U.S. Pat. No. 5,151,981 xe2x80x9cInstruction Sampling Instrumentation,xe2x80x9d issued to Wescott et al. on Sep. 29, 1992 proposes a hardware mechanism for instruction-based sampling in an out-of-order execution machine. There are a number of drawbacks in the approach taken by Wescott et al. First, their approach can bias the stream of instruction samples depending on the length of the code being sampled and the sampling rate. Second, their system only samples retired instructions, and not all instructions fetched, some of which may be aborted. Third, the information collected by the Wescott et al. mechanism focuses on individual event attributes, e.g., cache misses, but does not provide useful information for determining inter-instruction relationships.
More recently, a hardware mechanism called xe2x80x9cinforming loadsxe2x80x9d has been proposed, please see, Horowitz et al, xe2x80x9cInformed memory operations: Providing memory performance feedback in modem processors,xe2x80x9d Proceedings 23rd Annual International Symposium on Computer Architecture, pp. 260-270, May 22, 1996. There, a memory operation can be followed by a conditional branch operation that is taken if and only if the memory operation misses in the cache. Although not specifically designed for profiling, that mechanism could be used to specifically gather just D-cache missed event information.
In other specialized hardware, called a cache miss look-aside (CML) buffer, virtual memory pages that suffer from a high level-2 cache miss rate are identified, see Bershad et al. xe2x80x9cAvoiding conflict misses dynamically in large direct-mapped caches,xe2x80x9d Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 158-170, Oct. 4, 1994, for a full description.
Some processors, such as the Intel Pentium, permit software to read the contents of the branch predictor""s branch target buffer (BTB). By periodically reading the BTB in software, Conte et al. developed a very low overhead technique to estimate edge execution frequencies of a program, see xe2x80x9cUsing branch handling hardware to support profile-driven optimization,xe2x80x9d Proceedings of the 27th Annual International Symposium on Microarchitecture, pp. 12-21, Nov. 30, 1994.
That approach yields information that is similar to that which could be obtained by keeping track of the branch direction information contained in a xe2x80x9cprofile recordxe2x80x9d storing related sampling information. More recently, Conte et al. proposed a piece of additional hardware called a profile buffer which counts the number of times a branch is taken and not-taken, see xe2x80x9cAccurate and practical profile-driven compilation using the profile buffer,xe2x80x9d Proceedings of the 29th Annual International Symposium on Microarchitecture, pp. 36-45, Dec. 2, 1996.
Provided is an apparatus and method for measuring the operation of processors which depart from traditional mechanisms. Rather than counting events, and sampling the program counter when event counters overflow, the present apparatus and method relies on randomly selecting instructions, and sampling detailed state information for the selected instructions.
Periodically, during operation of a processor, an instruction to be profiled is randomly selected, and a profile record of what happens during the execution of the instruction is accumulated in a set of internal profile registers of the processor. After processing of the selected instruction terminates, e.g., the instruction retires, aborts, or traps, an interrupt is generated. The recorded information characterizing the details of how the instruction was processed in the pipeline can be sampled from the internal profile registers by software.
The profile registers can record many useful facts about an instruction""s execution. Example performance information can include: the number of cycles the selected instruction spent in each stage of an execution pipeline, i.e., stage latencies, whether the instruction suffered I-cache or D-cache misses, the effective addresses of its memory operands, or branch/jump targets, and whether the instruction was retired or aborted.
On in-order executing processors, it is possible to estimate the total number of stall cycles attributable to each instruction when one is given the fetch-to-retire latencies of sampled instructions. This is sufficient to identify bottlenecks because one stalled instruction cannot overlap with another stalled instruction.
On an out-of-order processor, most stalls are likely to overlap and be masked by other instructions issued out-of-order around the stalled instructions. This makes the identification of stalled instructions difficult. In addition, it may be necessary to collect information about the average level of concurrency while each instruction was executing in order to identify bottlenecks.
Special-purpose hardware could count and record the number of instructions that issue while a profiled instruction is executing to measure the level of concurrent execution. However, this fails to account for instructions that issue but are aborted, and therefore fail to retire. Provided here is a measurement of the amount of useful concurrency. The useful concurrency being the average number of instructions that issue in parallel and successfully retire with a given instruction. Instructions that issue but subsequently abort are not useful. Then, instructions whose stalls are not masked by useful concurrency can be classified as bottlenecks. To state this another way, a key metric for pinpointing performance bottlenecks on an out-of-order processor is the number of issue slots that are wasted while a given instruction executes.
Accordingly, in order to measure useful concurrency, a technique called xe2x80x9cpair-wise samplingxe2x80x9d is provided. The basic idea is to implement a nested form of sampling. Here, a window of instructions that may execute concurrently with a first profiled instruction is dynamically defined. A second instruction is randomly selected for profiling from the window of instructions. The profiled and second instruction form a sample pair for which profile information can be collected.
Pair-wise sampling facilitates the determination of the number of wasted issue slots attributable to each instruction, and pinpoints bottlenecks much more accurately than known techniques. In general, pair-wise sampling is very flexible, forming the basis for analysis that can determine a wide variety of interesting concurrency and utilization metrics.
Specifically, provided is an apparatus and method for periodically, and randomly selecting one or more instructions processed by a pipeline of a processor, and to collect profile information while the instruction progresses through stages of an execution pipeline. Higher-level software can then post-process this information in a variety of ways, such as by aggregating information from multiple executions of the same instruction.
Examples of information that can be captured include: the instruction""s address (program counter or PC), whether the instruction suffered an instruction cache miss and the latency incurred to service the miss. If the instruction performs a memory operation, determine whether the instruction suffered a data-cache miss and measure the latency for satisfying the memory request. Furthermore, the amount of time the instruction spends in each pipeline stage can be measured. The profile information can also indicate whether the instruction retired or aborted, and in the later case what kind of trap caused execution of the instruction to be aborted.
The information is collected in a set of profiling registers as the instruction progresses through the execution pipeline. When an instruction finishes executing, either because it retires or because it aborts, an interrupt is delivered to higher level software. The software can then process the information present in the profiling registers in a variety of ways.
Although the sampled performance information is very useful for profile-directed optimization, there are also many uses for hardware event-counters, such as counting the aggregate number of occurrences of an event.
The disclosed technique is an improvement over existing performance-monitoring hardware, and can be efficiently implemented at a relatively low hardware cost in modern microprocessors that can issue instructions out-of-order.
A method is provided for scheduling execution a plurality of threads executed in a multithreaded processor. Resource utilizations of each of the plurality of threads is measured while the plurality of threads are concurrently executing in the multithreaded processor. Each of the plurality of threads is scheduled according to the measured resource utilizations using a thread scheduler.