1. Field of the Invention
The present invention relates to an improved data processing system and, in particular, to a method and apparatus for optimizing performance in a data processing system. Still more particularly, the present invention provides a method and apparatus for monitoring execution of a software program through performance instrumentation.
2. Description of Related Art
Effective management and enhancement of data processing systems requires knowing how and when various system components are operating. In analyzing and enhancing performance of a data processing system and the applications executing within the data processing system, it is helpful to gather information about a data processing system as it is operating.
In order to minimize the undesired effects of instrumentation, the execution of instrumentation code is controlled in some manner. Typically, the performance instrumentation is toggled on or off through the use of one or more globally addressable variables that bracket sections of instrumentation code within the instrumented application. As performance instrumentation code is encountered, the global variable is tested to determine whether or not the instrumentation code should be executed. However, even when the instrumentation is turned off, the overhead of testing a global variable may add unacceptable delay to the performance of the application; the execution of the instructions for testing a global variable not only consumes CPU time but may cause the flushing of the instruction cache or instruction pipeline within the processor. Hence, production software is generally shipped without any installed performance instrumentation, and when a performance problem arises at a later time, a version of the production software that contains performance instrumentation must be built and installed before performance problems can be diagnosed, which is a cumbersome and time-consuming process.
The above issues are particularly important when instrumentation code is inserted into an operating system kernel. In a production environment, such as an on-line transaction processing system accepting orders over the Internet, it can be impossible to introduce a new version of the kernel, i.e. an instrumented version, without considerable testing to make sure that the instrumented version is as reliable as the current production version. This is due to the high cost of a kernel failure in a production environment. Similarly, in a production environment, efficiency of the kernel is extremely important, since the required processing rate of the system may be very high. For both of these reasons, it is advantageous to have the instrumentation installed in the original production kernel in such a way that its effect on performance of the kernel when it is installed, but disabled, is minimal. If the production system is unable to meet its performance goals at some subsequent point in time, then it is an acceptable risk to enable the instrumentation and the associated measurement overhead in order to fix the performance problem so that the system can meet its performance goals.
One type of system that is common in high-performance online transaction processing is a symmetric multiprocessor (SMP) system. An SMP system consists of several processors, each sharing access to a single memory store; data in the shared store is accessed by each of the processors. An SMP system has more processing power than a single processor system for servicing user requests. However, adding additional processors is not without an associated cost: additional synchronization instructions must be executed by the processors in order to make sure that the data shared among the processors is manipulated in a consistent manner.
Performance of an SMP system is generally determined by two factors: instruction path-length and synchronization overhead. Instruction path-length is the number of instructions that the kernel must perform in order to accomplish a particular task. Typically, this is the same on a kernel designed for single processor hardware as it is for SMP hardware with the exception of the additional instructions required for synchronization. Instruction path length can be measured and optimized with instruction counting software, hardware, or other well known tools, such as profilers.
The second factor that can limit the performance of a multiprocessor kernel is synchronization overhead. A common method of SMP synchronization at the software level is for all of the processors to follow a locking protocol when accessing or updating data shared between the processors. Typically, this means that a lock must be acquired before accessing a shared resource, such as a shared data structure, and then released after the access. Contention arises when more than one process in the system tries to acquire a lock at the same time. Correct execution requires that only one of the processes can succeed; the other processes must be delayed until the lock is released.
A delay can be implemented either by xe2x80x9cspinningxe2x80x9d, i.e., executing a tight instruction loop that constantly tries to acquire the lock, or by suspending the task that is attempting to access the shared resource and dispatching that task""s processor to run some other task in the system. Locks can thus be classified as either xe2x80x9cspinxe2x80x9d locks or xe2x80x9csuspendxe2x80x9d locks depending on how a conflicting-lock access is delayed. Each class of lock has its advantages and disadvantagesxe2x80x94acquiring or releasing a spin lock can be very inexpensive but waiting for a lock in a spin loop wastes time that could be devoted to useful work. A task that is suspended while waiting for a lock does not consume processor time, but the cost of acquiring or releasing a suspend lock is much higher than it is for a spin lock. For these reasons, both spin locks and suspend locks are typically present in a multiprocessor operating system kernel.
Spin locks, however, are more primitive and are normally used to implement suspend locks. In either case, excessive contention for a lock can lead to poor system performance, either because too many tasks are suspended, or because too much time is wasted by spinning and waiting for a lock to become available.
Given the complexities of designing and implementing an SMP system, one should be able to instrument various operations of the operating system kernel, and given the importance of spin locks in an SMP system, one might desire to insert instrumentation code into a kernel spin lock in order to gather performance information related to the operation of spin locks. Due to the nature of kernel operations, one would especially desire to minimize the overhead associated with instrumentation code within the kernel, including spin locks.
Therefore, it would be advantageous to provide a method and a system for minimizing overhead effects caused by the execution of instrumentation code associated with kernel spin locks. It would be particularly advantageous to provide an efficient methodology that allows the instrumentation code to be present within a production-quality operating system kernel. Additionally, since only those locks for which contention occurs have a significant impact on performance, it would be advantageous to limit instrumentation so that instrumentation is only enabled for those locks for which contention occurs.
A method, a system, and a computer program product are presented for (1) controlling operating system kernel spin lock instrumentation for a spin lock in a data processing system that has a cache that results in virtually no overhead when the instrumentation is installed but disabled, (2) restricting enablement of the instrumentation to only those locks for which contention occurs, and (3) dynamically detecting when contention occurs and enabling spin lock instrumentation for locks so detected. A lock flag represents a busy state for the spin lock; a first instrumentation flag is a global variable that represents an enablement state for the spin lock instrumentation. A second instrumentation flag, stored within the same cache line as the lock flag, is also maintained as an updateable indication of the first instrumentation flag. Prior to each acquirement of the spin lock, the second instrumentation flag is checked to see if it indicates that spin lock instrumentation is enabled for this particular spin lock. Although a reading of the lock flag may generate a cache miss, the lock flag is necessarily checked upon attempting to acquire the lock; the check of the second instrumentation flag cannot generate a superfluous cache miss because the second instrumentation flag is in the same cache line as the lock flag. At some point, the second instrumentation flag must be updated to reflect the enablement state that is stored within the first instrumentation flag; the update is delayed until it is determined that the spin lock is in a busy state when a new lock request is made, thereby inducing entry into a spin loop that necessarily wastes execution cycles. Therefore, prior to entering the spin loop, the first instrumentation flag can be read without regard to a cache miss, and the second instrumentation flag is then updated to reflect the value of the first instrumentation flag.