The present invention relates generally to monitoring the performance of computer systems, and more particularly to gathering performance statistics by sampling, and analyzing the sampled statistics to guide system optimization.
Computer systems are getting more sophisticated and faster, yet software application performance is not keeping pace. For example, in a typical four-way issue processor only about one in twelve of the available issue slots is being put to good use. It is important to understand why the software execution flow cannot take full advantage of the increased computing power available for processing instructions. Similar issues arise in other devices in computer systems, including graphics controllers, memory systems, input/output controllers, and network interfaces: actual performance is often less than the peak potential performance, and it is important to understand why.
It is common to blame such problems for processors on memory latencies, in fact, many software applications spend many cycles waiting for data transfers to complete. Modem memories are typically arranged in a multi-level hierarchy. There, the data flow is complex and difficult to determine, especially when multiple contexts are concurrently competing for the same memory resource such as cache blocks. Other problems, such as branch mispredicts and cache misses also waste processor cycles and consume memory bandwidth for needlessly referenced data.
Input/Output interfaces and network controllers of computer systems are also becoming more sophisticated. In many implementations, the interfaces and controllers include microprocessors and buffer memories whose dynamic behavior is becoming more difficult to measure and understand as complexity increases.
Independent of the general causes, system architects, and hardware and software engineers need to know which transactions are stalling, what data are bottlenecked, and why in order to improve the performance of modem computer systems.
Typically, this is done by generating a xe2x80x9cprofilexe2x80x9d of the behavior of a computer system while it is operating. A profile is a record of performance data. Frequently, the profile is presented graphically or statistically so that performance bottlenecks can readily be identified.
Profiling can be done by instrumentation and simulation. With instrumentation, additional code is added to executing programs to monitor specific events. Simulation attempts to emulate the behavior of the entire system in an artificial environment rather than executing the program in the real system. Also, instrumentation can only be used only for processor pipelines, not for other devices.
Each of these two methods has its drawbacks. Instrumentation perturbs the system""s true behavior due to the added instructions and extra data references. In other words, on large scale and complex systems instrumentation fails in two aspects. The system is slowed down, and the performance data is bad, or at best, sketchy.
Simulation avoids perturbation and overhead. However, simulation only works for small well defined problems that can readily be modeled. It is extremely difficult, if not impossible, to simulate a large scale system, with thousands of users connected via fiber optic links to network controllers, accessing terabytes of data using dozens of multi-issue processors. Imagine modeling a Web search engine, such as Digital""s Alta Vista, that responds to tens of millions of hits each day from all over the world. Each hit perhaps offering up hundreds Web pages as search results.
Hardware implemented event sampling has been used to provide profile information for processors. Hardware sampling has a number of advantages over simulation and instrumentation: it does not require modifying software programs to measure their performance. Sampling works on complete systems, with a relatively low overhead. Indeed, recently it has been shown that low-overhead sampling-based profiling can be used to acquire detailed instruction-level information about pipeline stalls and their causes. However, many hardware sampling techniques lack flexibility because they are designed to measure specific events in isolation.
It is desired to provide a generalized method and apparatus for monitoring the performance of operating computer systems. The method should be able to monitor processors, memory sub-systems, I/O interfaces, graphics controllers, network controllers, or any other component that manipulates digital signals.
The monitoring should be able to sample arbitrary transactions and record relevant information about each. In contrast with event-based system, arbitrary transaction monitoring should allow one to monitor not only discrete events, but also events in any combination. It should also be possible to relate the sampled events to individual transactions such as instructions, or memory references, or contexts in which the transactions arose. In addition, it should be possible to relate the sampled data to multiple concurrent transactions in order to gain a true understanding of the system. All this should be possible, without perturbing the operation of the system, other than the time required to read the desired performance data.
Provided is a method and apparatus for monitoring a computer system including a plurality of functional units, such as processors, memories, I/O interfaces, and network controllers.
Transactions to be processed by a particular functional unit of the computer system are selected for monitoring. The transactions can be selected randomly, or concurrently. State information is stored while the selected transactions are processed by the functional unit. The state information is analyzed to guide optimization.
In one aspect, multiple different functional units can concurrently be sampled.