1. Field of the Invention
This invention relates to computer systems, and more particularly to monitoring the performance of a microprocessor.
2. Description of the Relevant Art
Most computer systems include a microprocessor which functions as a central processing unit (CPU). Modern microprocessors, including the Intel Pentium(trademark) processor, have hardware dedicated for measuring and monitoring various parameters which contribute to the performance of the microprocessor. In the case of the Pentium(trademark) processor, the dedicated hardware includes several model specific registers (MSRs): a 64-bit time stamp counter (TSC) incremented every clock cycle, a control and event select register (CESR), and two 40-bit performance monitor counters (CTRs). The TSC, CESR, and the two CTRs are addressable registers, and their contents may be read or changed by software instructions. Each CTR may be individually programmed, via values stored within the CESR, to count the total number (or duration in clock cycles) of specific xe2x80x9ceventsxe2x80x9d occurring within the microprocessor during operation. Such events include memory accesses (e.g., data/code reads and data writes), data/code cache misses, pipeline flushes, and locked bus cycles. The information provided by the dedicated hardware may be used to improve the overall performance of the computer system by xe2x80x9ctuningxe2x80x9d the memory system or software programs generated by compilers.
Several problems limit the usefulness of the existing performance monitoring hardware. First, there are only two CTRs, thus a maximum of two events may be monitored at any given time. The CTRs are programmed by values stored within the CESR, and there are a fixed number of events to choose from. For example, there are 38 documented events from which to choose for the Pentium(trademark) processor. In order to obtain counts for all events which may be monitored, it is necessary to repeat a test program 19 times while gathering counts for two events during each execution of the test program.
Second, and most importantly, there is no way to correlate the occurrence of an event with the time at which the event occurred. In cases where several factors affect a given aspect of system performance, the total number of events may indicate the presence or absence of a problem, but may not be particularly useful in determining the best solution to a problem. In some cases, a graph of the frequency distribution of an event is much more useful than the total number of events which occurred during execution of a test program.
A histogram is a bar graph of a frequency distribution in which the heights of the bars represent the total number of events occurring within in a corresponding time interval. Forming a histogram involves dividing a time period of interest into time intervals of equal length, and counting the total number of events occurring within each time interval. As a practical matter, summing numbers of events occurring within time intervals reduces the data storage requirements of a data acquisition system performing the counting operation while still providing useful event frequency information.
A good example illustrating the utility of a graph of the frequency distribution of an event is cache misses occurring during execution of a test program. FIGS. 1 and 2 will now be used to illustrate how such a graph may suggest which of several factors is the most likely cause of a problem. As described above, a desired data acquisition time is divided into time intervals (i.e., histogram time periods) of length t, and the total number of cache misses occurring within each histogram time period t are counted and graphed. FIG. 1 is a histogram showing the frequency of cache misses occurring within a first memory system during execution of the test program. In the first memory system, the frequency of cache misses follows a trend. The frequency of cache misses is initially high as the empty cache is filled, decreases relatively quickly at an initial rate 10, then continues to decrease as more needed instructions are located within the cache. Eventually a lowest number of cache misses xe2x80x9cM1xe2x80x9d is achieved by the first memory system. Sudden increases or xe2x80x9cspikesxe2x80x9d (e.g., spike 12) in the frequency of cache misses occur as when new sections of program code are loaded into memory and executed.
FIG. 2 is a histogram showing the frequency of cache misses occurring within a second memory system during execution of the same test program. As in the first memory system, the frequency of cache misses within the second memory system is initially high as the empty cache is filled, and decreases with time as more needed instructions are found within the cache. The initial rate of the decrease 14 is not as great as that of the first memory system, however, and the lowest number of cache misses M2 achieved by the second memory system is substantially greater than M1. Spike 16 corresponds to spike 12, and occurs as the same section of program code is loaded into memory and executed. Spike 16 occurs later in time than spike 12 as the second memory system is less efficient than the first.
Key factors which affect the frequency of cache misses within a memory system include cache size and the technique used to select information stored within the cache for replacement by xe2x80x9cnewerxe2x80x9d data (i.e., the cache replacement algorithm). FIG. 1 indicates the cache replacement algorithm of the first memory system is adequate. The best way to reduce the frequency of cache misses and thereby improve the performance of the first memory system is to increase the size of the cache. On the other hand, FIG. 2 indicates the cache replacement algorithm of the second memory system is probably not working well. Increasing the size of the cache would not be the best way to improve the performance of the second memory system; improving the cache replacement algorithm would probably be more effective.
It would be beneficial to have a microprocessor which includes performance monitoring hardware allowing more than two events to be monitored at any given time and correlating the occurrence of an event with the time at which the event occurred. Such a microprocessor would reduce the number of times a test program must be executed in order to gather performance monitoring information. Such a microprocessor would also allow graphs of numbers of events versus time to be created, greatly enhancing the ability to increase the overall performance of the computer system by xe2x80x9ctuningxe2x80x9d the memory system or instruction sequences generated by compilers.
The problems outlined above are in large part solved by an apparatus and method for monitoring the performance of a microprocessor. The apparatus includes performance monitoring hardware incorporated within the microprocessor. The performance monitoring hardware includes a memory unit for storing performance data relating to operations performed by the microprocessor. The memory unit includes multiple memory locations, each memory location being accessed by a unique set of address signals. The performance monitoring hardware further includes circuitry coupled to the memory unit for producing address signals. The apparatus and method center around gathering performance data in order to generate event histograms.
In one embodiment, the performance monitoring hardware further includes an event select register array, a control register, a bus monitor unit, circuitry coupled to the memory unit for producing a set of high order (i.e., most significant) address signals, and a control unit. The event select register array includes n event select registers, where nxe2x89xa71, and preferably nxe2x89xa72. Each event select register may contain a binary code corresponding to a selected event. The event select register array allows the performance monitoring hardware to monitor up to n selected events within the microprocessor.
The control register enables and disables a performance data acquisition mode of the performance monitoring hardware. The control register also includes an event select register field which determines the specific event select register accessed within the event select register array, and a memory address field which determines which memory location within the memory unit is accessed during retrieval of performance data stored within the memory unit.
The bus monitor unit is coupled to internal address, data, and control signal lines within the microprocessor, the event select register array, the control register, and the control unit. The bus monitor unit is also operably coupled to the memory unit. The bus monitor unit detects the occurrence of each of the up to n selected events stored within the event select register array. The occurrence of a selected event is determined by signals driven upon the internal address, data, and control signal lines of the microprocessor. Upon detecting one or more of these selected events, the bus monitor unit produces an event signal and a low order (i.e., least significant) address signals, and wherein a is an integer and axe2x89xa7log2(n).
The circuitry coupled to the memory unit for producing the set of high order address signals includes a time stamp counter, a histogram time base register, a time base comparator, and a histogram counter. The time stamp counter is a counter configurable to increment every cycle of a processor clock signal. The histogram time base register is used to store the number of processor clock cycles within each histogram time period. The time base comparator is coupled to the time stamp counter and the histogram time base register. The time base comparator divides the contents of the time stamp counter by the value stored within the histogram time base register and produces a clock pulse when the remainder of the division is zero. The histogram counter is a counter which receives the clock pulses produced by the time base comparator and increments upon each received clock pulse. The contents of the histogram counter forms the set of high order address signals.
The control unit is coupled between the bus monitor unit and the memory unit. The control unit produces control signals in response to the event signal which result in the incrementing of a value stored within a memory location within the memory unit. The memory location is accessed by concatenating the high order address signals and the low order address signals.
A performance data acquisition period is divided into multiple histogram time periods of equal length. The high order address signals produced by the histogram counter partition the memory unit into sections. Each section is associated with a given histogram time period and contains at least n memory locations, where n is the number of event select registers within the event select register array and the maximum number of selected events. Each section is used to store performance data relating to the selected events which occur during the corresponding histogram time period.
Each occurrence of one of the n selected events during a given histogram time period results in the incrementing of the contents of a corresponding memory location within the corresponding section of the memory unit. For example, the occurrence of an event identified within event select register 0 (i.e., event 0) results in the incrementing of the contents of memory location xx00h. During the first histogram time period, the high order address signals produced by the histogram counter are 00 . . . 0, and the contents of memory location 00 . . . 0000000 is incremented. Similarly, the occurrence of event 1 during the first histogram period results in the incrementing of the contents of memory location 00 . . . 0000001.
A computer system in accordance with the present invention includes the microprocessor described above. The microprocessor functions as a central processing unit (CPU), and includes performance monitoring hardware having a memory unit for storing performance data. In addition to the microprocessor, the computer system may include a system bus adapted for coupling to one or more peripheral devices. Chip set logic coupled between the microprocessor and the system bus may function as an interface between the microprocessor and the system bus.
A method for monitoring the performance of the microprocessor of the computer system described above includes enabling the performance data acquisition mode of the performance monitoring hardware, then causing the microprocessor to execute a set of instructions. During instruction execution, performance data is stored within the memory unit of the performance monitoring hardware. Following execution of the set of instructions, the performance data acquisition mode is disabled, and the data stored within the memory unit is retrieved using circuitry for this purpose within the performance monitoring hardware.