1. Field of the Invention
This invention relates generally to computer microprocessors. More particularly, the present invention relates to an apparatus and method for monitoring the performance of a microprocessor in real-time, at the frequency of the microprocessor.
2. Description of the Related Art
Modern computers contain microprocessors, which are essentially the brains of the computer. Modem microprocessors use a design technique called a pipeline, in which the output of one process serves as input to a second, the output of the second process serves as input to a third, and so on, often with more than one process occurring during a particular computer clock cycle. Modern computers and computer microprocessors contain a number of pipelines, and each particular pipeline contains a number of stages.
A computer program contains numerous instructions, which tell the computer what precisely it must do, to achieve the desired goal of the program. A computer runs a particular computer program by executing the instructions contained in the program. Theoretically, an instruction should complete execution in a number of computer cycles equal to the number of pipeline stages contained in the computer. If it takes longer, there should be a reason for the extra cycles. It might be that the extra cycles occur because of how the microprocessor was designed, and how the microprocessor must operate. The extra cycles might occur because of how the computer program was designed, and how the computer program operates. If the extra cycles are caused by the computer program's design, that design might be altered to eliminate or at least reduce the number of extra cycles. Such redesigning of the computer program might be done by the program designer, or might be done by a compiler or other computer program which translates a higher-level computer program into lower-level instructions that can be executed by the computer. Such fine-tuning of a computer program, so as to eliminate or reduce extra cycles, requires identifying the cause or causes producing those extra cycles.
During program execution in a modem microprocessor pipeline, instructions often suffer execution delays because of cache misses, branch mispredictions, memory access delays, and so forth, each of which result in extra cycles, sometimes also called delay cycles. A detailed understanding of which types of delays are producing large numbers of delay cycles would allow the programmer, or the compiler or other software tuning tool, to modify the program's instruction stream so as to reduce the number of delay cycles and, as a result, cause the program to execute faster. A performance monitor is intended to provide such understanding.
Known prior art performance monitors monitor by simple counting the number of cache misses, branch mispredictions, and so forth. But not all such events contribute to a program's visible delay, due to parallel and super-scalar execution capabilities of today's processor pipelines, decoupling buffers used between multiple serial pipelines in today's processors to separate one pipeline from another, and so forth. For example, it is possible for a data cache miss to occur without causing a pipeline delay, if the use of the data happens long after the data actually is available for use. Consequently, simply counting the number of miss-events does not provide an accurate picture of where cycles are being wasted.
Some known prior art performance monitors include hardware counters that simply count certain events, such as data cache misses, in isolation without regard to whether or not the event counted actually produces a pipeline delay. In some known prior art performance monitors monitoring is done by software simulation. Such simulation is slow, and cannot be used effectively on present day and future processor pipelines capable of parallel and super-scalar execution.
The present invention tracks actual delay cycles in real-time, at the full frequency of the microprocessor, and is designed to work with advanced microprocessor architectures that feature speculative execution, pipelining, super-scalar execution, and/or decoupling buffers. Moreover, the present invention does not slow down the execution of the computer program's instruction stream, because the invention operates in parallel to the main processor pipeline. When implemented in the CPU hardware, the present invention eliminates the need for software simulation, and gives accurate, real-time breakdowns of processor stall cycles. This information may then be used by software for tuning operating systems and application programs. Examples of such software include Vtune.RTM., a program commercially available from Intel Corporation, and profile-guided compilers.
Thus the present invention is directed to overcoming, or at least reducing, the effects of one or more of the problems mentioned above.