1. Field of the Invention
The present invention pertains to a method and apparatus for dynamic thread-level processor performance characterization. More particularly, the present invention pertains to a method and apparatus for tracking processor events, at the software thread level, during execution of one or more software programs.
2. Related Art
A processor, such as a Pentium.RTM. processor (Intel Corporation, Santa Clara, Calif.) is capable of executing a series of instructions in succession. The instructions to be executed are typically stored in a memory (such as Random Access Memory). Also, the processor can be coupled to a cache memory for storing data or instructions to be used by the processor. As the processor executes instructions, certain "events" can be monitored which can typically characterize the performance of the processor. As used herein, the term "event" refers to actions taken by the processor, errors in the operation of the processor, and any other such information that indicates a performance characteristic of the processor or the like.
Performance monitoring is available in connection with the operation of the Pentium.RTM. processor. This performance monitoring is described, for example, in "Pentium.RTM. Processor Family Developer's Manual, Vol. 3: Architecture and Programming Manual" (1995, pp. 26-1 to 11), which is available from Intel Corporation. As seen in this reference, performance registers are provided in the processor for monitoring various parameters that contribute to the performance of the processor. For example, in several models of the Intel Pentium.RTM. processor, the following performance registers are provided on-chip: a 64-bit Time Stamp Counter (TSC), two programmable event counters (CTR0, CTR1), and a Control and Event Select Register (CESR). By placing a value in the CESR, one or both of the counters (CTR0, CTR1) are set up to count a desired event or to count clock signals while an event condition is present or absent. For example, by placing the appropriate data value in the CESR, the first counter, CTR0, can be set up to count the number of times a data read operation is performed by the processor (e.g, from its cache memory). Once CTR0 is set up to perform this task, each time the processor performs a data read operation, CTR0 increments its internal count. There are numerous events that can be monitored using this system such as data cache read/write misses, loading of a segment register, etc. As mentioned above, the duration of an event condition (or absence thereof) can be monitored such as the number of clock pulses counted while a bus cycle is in progress, the number of clocks stalled due to full write buffers, etc. Other examples can be found in the Developer's Manual referenced above. Similar performance monitoring features are available in other processors such as the Alpha.RTM. processor by Digital Equipment Corporation, Maynard, Mass. and the PowerPC.RTM. processor by Motorola Corporation, Schaumburg, Ill.
A user accesses the CESR, counters (CTR0, CTR1) and TSC via execution of a device driver. For example, a user's request to load a value into the CESR includes the following steps: 1. a performance monitoring application allows the user to input a desired event monitoring scenario; 2. the request is translated into a value for the CESR; 3. this value is then sent by the performance monitoring application to the device driver via an operating system (OS) service call; 4. the performance device driver loads the CESR with the appropriate value. The performance device driver can then retrieve values stored in the CESR and counters (CTR0, CTR1) and any time-stamp information from the TSC and forward the data to the performance monitor application for display to the user.
The performance monitoring system described above is useful to programmers who write source code. For example, the performance monitoring system can detect events which tend to indicate inefficiencies in the overall code design. Processor designers and architects can also benefit since it allows them to observe the properties of the software that will execute on the processor and can, therefore, optimize their hardware design to deliver the best performance for the software. Designers of integrated circuit (IC) chips to be used with the processor can also benefit from this system in a similar manner. The Windows95.RTM. Operating System and Windows NT.RTM. Operating System (Microsoft Corporation, Redmond, Wash.) are examples of operating systems that support the performance monitoring system described above.
In a typical processor system, one or more applications are running (i.e., being executed by the processor). As known in the art, the code of an application (e.g., a word-processing application, a video conferencing application, etc.) can be divided into a plurality of processes and each process can be divided into a plurality of threads. Thus, a thread can be a series of instructions that are executed by the processor to achieve a given task (e.g., a subroutine) and is identified by a 32-bit code in the Windows 95.RTM. and Windows NT.RTM. operating systems. A processor is often switching between threads of a process and between processes of the application. In a so-called multi-tasking environment, the processor is also switching between two or more applications.
A drawback of the aforementioned performance monitoring system is that it primarily focuses on the operation of the processor without consideration as to which thread, process, or application is being executed. A feature that exists in the Windows NT.RTM. operating system as opposed to the Windows 95.RTM. operating system is that a user can track how much time, as a percentage, the processor spends executing instructions in a thread compared to the system as a whole. Thus, over a given amount of time the processor is executing instructions, the user is able to see the percentage of that time that is taken by a given thread using this feature of the Windows NT.RTM. operating system. With the operating systems and performance monitoring systems available in the art, however, a user is unable to determine, at the thread level, how many of the aforementioned monitored processor-level events have occurred.
As an example, in a multimedia application that combines both audio processes and video processes, the user could use the performance monitoring system described as known in the art to determine that a greater than normal number of data cache read/write misses have occurred during execution of the application. Using the techniques known in the art alone, the user would not be able to determine, however, whether the execution of threads in the audio or video processes were contributing to the number of data cache read/write misses. If the particular event that is being monitored is adversely affecting the operation of the application(s), then it would be advantageous to determine in which process or thread the event occurs. For example, knowing such information would allow a programmer to redraft that section of the code so as to reduce the number of the monitored events during execution of the program. Without such information, the programmer is left to attempt to redraft multiple threads and processes to achieve the same result. Accordingly, there is a need for a method and apparatus that allows a user to dynamically monitor thread-level processor performance.