1. Field of the Invention
This invention relates to computer systems, and more particularly, to finding the sources of lost cycles in a microprocessor, which create performance loss.
2. Description of the Relevant Art
Computer systems may have their performance increased after the systems are already built if the design is able to report necessary computing statistics. The statistics may correspond to the execution of software applications. Then the software applications may be modified, in order to improve subsequent executions of applications, in accordance with the feedback provided by the computing statistics. One manner to obtain the necessary computing statistics is to have the one or more microprocessors in the system provide the necessary computing statistics as the microprocessor(s) execute software applications.
Microprocessors may contain one or more processor cores, or processors, with each processor capable of performing the execution of instructions of an application. Modern processors are pipelined, or the processors are comprised of one or more data processing stages connected in series wherein storage elements are placed in between the stages. The output of one stage is made the input of the next stage during each transition of a clock signal. Level-sensitive latches may be used as storage elements in a pipeline at a phase-boundary, or a portion of a clock cycle. Edge-sensitive flip-flops may be used as storage elements in a pipeline at a cycle boundary. The amount of execution of an instruction performed within a pipeline stage is referred to as the amount of execution performed by integrated circuits between flip-flops at a clock cycle boundary. Ideally, every clock cycle produces useful execution for each stage of the pipeline. When an event occurs, such as a branch misprediction, a dependence of an instruction operand on a result of a previous instruction, a cache miss, etc, that prevents useful execution in a stage of the pipeline, then it is said an instructions-per-clock-cycle (IPC) loss occurs. No useful work is performed by the microprocessor during this pipeline stage. In order to reduce IPC losses in a pipeline, modem processors may execute instructions of a software program in a different sequence than the in-order sequence they appear in the program. The retirement of the instructions would be in-order still so that the architecture state would be valid in the case of an interrupt. In addition to this out-of-order execution, modem microprocessors may utilize data forwarding, compiler loop unrolling and rescheduling, improved branch prediction methods, parallel execution by multiple functional units, etc, in order to reduce IPC losses in a pipeline. When an IPC loss occurs for one instruction while useful work is still performed by at least a parallel second instruction that overshadows or hides the IPC loss of the first instruction, no performance loss is suffered. An IPC loss that is not overshadowed and does reduce performance will be referred to as an IPC loss.
Upon completion of a preset number of clock cycles used for execution of a software program, such as one million cycles, reported statistics of a microprocessor for performance enhancement may include both the number of IPC loss cycles and the source of the IPC loss cycles. This information may aid compiler programmers and software application programmers to restructure the sequence of instructions in an application for improved performance of subsequent executions of the application.
Modern microprocessors contain performance counters to monitor and report performance-relevant events such as the number of cache misses, cache miss penalties, the number of branch mispredictions, etc. However, design techniques used to increase throughput and to reduce IPC losses in a microprocessor pipeline make it more difficult to accurately measure and report performance statistics such as the number of IPC loss cycles and the source of IPC losses. For example, if an instruction experiences a data cache miss, the number of cycles of the miss penalty may not all reduce throughput due to out-of-order and superscalar execution. Useful work may be performed while the cache miss is being serviced. Simply counting the number of cycles of cache misses in a performance counter during program execution does not accurately report the effect of cache misses on the computer system. Additionally, accurately reporting statistics requires the counters and logic to not affect the performance of the execution of a program.
In view of the above, efficient performance monitoring methods and mechanisms are desired.