1. Field of the Invention
This invention relates generally to computer microprocessors. More particularly, the present invention relates to an apparatus and method for monitoring the performance of a microprocessor in real-time, at the frequency of the microprocessor.
2. Description of the Related Art
Modem computers contain microprocessors, which are essentially the brains of the computer. Modern microprocessors use a design technique called a pipeline, in which the output of one process serves as input to a second, the output of the second process serves as input to a third, and so on, often with more than one process occurring during a particular computer clock cycle. Modern computers and computer microprocessors contain a number of pipelines, and each particular pipeline contains a number of stages.
A computer program contains numerous instructions, which tell the computer what precisely it must do, to achieve the desired goal of the program. A computer runs a particular computer program by executing the instructions contained in the program. Theoretically, an instruction should complete execution in a number of computer cycles equal to the number of pipeline stages contained in the computer. If it takes longer, there should be a reason for the extra cycles. It might be that the extra cycles occur because of how the microprocessor was designed, and how the microprocessor must operate. The extra cycles might occur because of how the computer program was designed, and how the computer program operates. If the extra cycles are caused by the computer program""s design, that design might be altered to eliminate or at least reduce the number of extra cycles. Such redesigning of the computer program might be done by the program designer, or might be done by a compiler or other computer program which translates a higher-level computer program into lower-level instructions that can be executed by the computer. Such fine-tuning of a computer program, so as to eliminate or reduce extra cycles, requires identifying the cause or causes producing those extra cycles.
During program execution in a modem microprocessor pipeline, instructions often suffer execution delays because of cache misses, branch mispredictions, memory access delays, and so forth, each of which result in extra cycles, sometimes also called delay cycles. A detailed understanding of which types of delays are producing large numbers of delay cycles would allow the programmer, or the compiler or other software tuning tool, to modify the program""s instruction stream so as to reduce the number of delay cycles and, as a result, cause the program to execute faster. A performance monitor is intended to provide such understanding.
Known prior art performance monitors monitor by simple counting the number of cache misses, branch mispredictions, and so forth. But not all such events contribute to a program""s visible delay, due to parallel and super-scalar execution capabilities of today""s processor pipelines, decoupling buffers used between multiple serial pipelines in today""s processors to separate one pipeline from another, and so forth. For example, it is possible for a data cache miss to occur without causing a pipeline delay, if the use of the data happens long after the data actually is available for use. Consequently, simply counting the number of miss-events does not provide an accurate picture of where cycles are being wasted.
Some known prior art performance monitors include hardware counters that simply count certain events, such as data cache misses, in isolation without regard to whether or not the event counted actually produces a pipeline delay. In some known prior art performance monitors monitoring is done by software simulation. Such simulation is slow, and cannot be used effectively on present day and future processor pipelines capable of parallel and super-scalar execution.
The present invention tracks actual delay cycles in real-time, at the full frequency of the microprocessor, and is designed to work with advanced microprocessor architectures that feature speculative execution, pipelining, super-scalar execution, and/or decoupling buffers. Moreover, the present invention does not slow down the execution of the computer program""s instruction stream, because the invention operates in parallel to the main processor pipeline. When implemented in the CPU hardware, the present invention eliminates the need for software simulation, and gives accurate, real-time breakdowns of processor stall cycles. This information may then be used by software for tuning operating systems and application programs. Examples of such software include Vtune(trademark), a program commercially available from Intel Corporation, and pofile-guided compilers.
Thus the present invention is directed to overcoming, or at least reducing, the effects of one or more of the problems mentioned above.
In one aspect of the present invention, a performance monitor is provided for use in parallel with a main processor pipeline. The performance monitor includes one or more silos (a series of storage elements) which receive a plurality of delay signals from the pipeline, which delay signals indicate particular reasons for extra cycles being required.
The silos outputs certain signals, which are received by a prioritizer. The prioritizer prioritizes the signals it receives according to a particular prioritization scheme, and then outputs a number of prioritized signals. The prioritized signals are then received by a combiner which selectively combines the prioritized signals, and outputs signals providing relevant information, for example, the delay cycles actually caused by branch mispredictions, the delay cycles actually caused by execution latency, the delay cycles actually caused by data access delays, the delay cycles actually caused by instruction access delays, and so forth. The number of cycles in a particular signal can then be counted to give a total number of delay cycles for that particular reason for delay.
According to an aspect of the present invention, the prioritizing and the selective combining may be combined, may be performed in hardware, or may be performed under the control of programmable software. According to another aspect of the present invention, when the performance monitor has a single silo, there is no need for prioritizing and selected combining.
According to another aspect of the present invention, each silo has a number of individual stages, one stacked above the other. In one embodiment of the present invention, each stage includes a single latch. In another embodiment, instead of a latch a flip-flop is used. What is required is structure capable of storing a single bit, and thus any memory element or anything that is capable of storing information may be used. A silo as used in this patent is intended to encompass all such structure. Each silo receives one or more of the delay reason signals provided by the main processor pipeline, and outputs a staged signal. The staged signal from each of the silos are the signals received by the prioritizer.
According to another aspect of the present invention, the number of stages in a particular silo is directly related to the position in the microprocessor pipeline of the pipeline stage producing a particular delay signal. The main processor pipeline includes a number of pipeline stages, including an ith stage and a jth stage, and this jth stage may provide one or more jth delay reason signals. In the pipeline, K stages separate the ith stage of the pipeline from the jth stage (not counting either the ith stage or the jth stage). One of the silos of the performance monitor has K+1 stages, that is, one more stage than the number of stages separating the ith stage and the jth stage of the pipeline, and, a jth delay reason signal from the jth stage of the pipeline is provided to the top-most stage, that is, the K+1st stage, of this silo. According to another aspect of the present invention, one of the silos has more than K+1 stages, and the jth delay reason signal from the pipeline is provided to the K+1st stage of the silo, and to each stage of the silo above the K+1st stage to the top of that silo. According to yet another aspect of the present invention, the number of stages in a particular silo is one less Man the number of stages from the beginning of the pipeline to the last stage in the pipeline where a delay can occur, and ajth delay reason signal is provided to all the stages in that silo.
According to another aspect of the present invention, cycle accounting for a microprocessor includes receiving certain of the delay reason signals, staging each of the received signals and outputting staged signals, prioritizing the staged signals and outputting prioritized signals, and selectively combining the prioritized signals and outputting signals. According to one aspect of the present invention, the cycle accounting is carried out at the frequency of the microprocessor. According to another aspect of the present invention, the cycle accounting is carried out in parallel to the microprocessor pipeline. And according to yet another aspect of the present invention, the cycle accounting continues to be carried out when the microprocessor pipeline experiences delays.
According to still another aspect of the present invention, a delay cycle accounting system is provided. The system includes a main processor coupled to a performance monitor. The processor includes a pipeline which operates in parallel to the performance monitor. The performance monitor is coupled to the pipeline, and includes one or more silos, each of which receives at least one of a plurality of delay reason signals provided by the pipeline. Each silo outputs a staged signal, and all such staged signals are received by a prioritizer. The prioritizer selectively prioritizes the staged signals it receives, and outputs at least two prioritized signals, at least one of which is a logical combination of at least two of the staged signals. A combiner receives the prioritized signals, and outputs at least one signal that is a logical combination of at least two of the prioritized signals. A counter receives this signal and counts the number of cycles the condition has occurred, and outputs a signal indicating this cycle count.
The present invention can deal with overlapping delays, such as overlapping stall conditions, delays that cause multiple pipeline effects, such as multi-cycle bubbles, flushes resulting from branch mispredictions, and so forth, and delays caused in decoupling buffers and elsewhere. The present invention is not limited to any particular microprocessor, and can readily be implemented for different instruction sets and pipeline microarchitectures that support speculative execution and super-scalar instruction execution.
The present invention is of significance importance to future microprocessors, because as microprocessor pipelines become deeper, faster, and wider, and the relative speed of memory becomes slower, detailed performance analysis becomes increasingly important. The present invention enables real-time break-down of program execution time, and allows measurement and analysis of performance bottlenecks on complex software systems in real-time. Large complex workloads, such as computer operating systems and databases, which cannot readily be simulated, can be effectively optimized using the present invention. These and other benefits will become evident as the present invention is described more fully below.