1. Technical Field
The present invention relates to an improved data processing system. More specifically, the present invention is directed to a method, apparatus, and computer program product for accurately measuring the useful processor capacity allocated to each thread in a dual-threaded simultaneously multi-threaded processor, where instructions from multiple threads may be dispatched on the same cycle, within a data processing system which allows dynamic variation of processor frequency to optimize performance and temperature.
2. Description of Related Art
A symmetric multiprocessing (SMP) data processing system has multiple processors that are symmetric such that each processor has the same processing speed and latency. An SMP system may be logically partitioned to have one or more operating systems that divide the work into tasks that are distributed evenly among the various processors by dispatching programs to each processor.
Modern micro-processors are usually superscalar, which means a single processor can decode, dispatch, and execute multiple instructions each processor cycle. These modern processors may also support simultaneous multi-threading (SMT), which means each processor can concurrently execute more than one software program (thread) at a time. An SMT processor typically has the ability to favor one thread over another when both threads are running on the same processor. Each thread is assigned a hardware-level priority by the operating system, or the hypervisor in a logically partitioned environment. Within each processor, the thread that has the highest priority will be granted more decode units and more dispatch cycles, thereby making more resources available to that thread. Therefore, the higher priority thread will use more of the processor's resources and as a result do more work than the lower priority sibling threads on the same processor.
One use for hardware thread priority has been the voluntarily lowering of thread priority when the importance of the work was known to be less than work being done by the other threads. For example, if one thread being executed by a processor was idle, the operating system might voluntarily lower its priority to permit the other threads being executed by the same processor access to more of the processor's resources.
When a high priority thread is not able to use all the allocated resources on a given processor cycle due to a cache miss, lower priority threads are permitted to use whatever resources that are not consumed by the higher priority thread. Different programs have different memory and cache access characteristics, and thus encounter different stalls due to cache misses. Thus, the total processing capacity actually utilized by each thread does not necessarily track with the assigned thread priority. For example, if a high priority thread encounters frequent cache misses, a lower priority thread will be allowed to utilize more of the processor resource than if the high priority thread does not encounter cache misses.
Time is important for managing the available processing resources. The number of active tasks or programs in a large SMP system may exceed the total number of hardware threads across all the processors in the system, which means not all of the programs can execute at the same time. The operating system (and hypervisor in a logically partitioned system) allocates portions of time to different sets of tasks or programs, with different durations (time slices) allocated to different tasks depending on the priority and resources required for each task.
A Time Base (TB) Register is used to represent time in a processor. The TB is a free-running 64-bit register that increments at a constant rate so that its value can be converted to time. The TB registers are synchronized across all processors in an SMP system so that all processors in the system have the same representation of time. The TB is a shared resource across all threads, and the constant rate that it increments is known to software executing on each thread. Software calculates time by multiplying the TB value by the known incrementing rate, and adding the result to a known time offset.
In prior designs, the processor clock frequency was a known constant, so the TB could simply increment every ‘n’ processor cycles, where ‘n’ is set depending on the desired granularity of time increment. For example, at a processor frequency of 1.0 GHz, with n=8, a TB value of ‘000000001234ABCD’x represents 2.4435 seconds.
In order to manage time slices and thread priority most effectively, the operating system must know the portion of a processor's resource capacity that is actually consumed by each thread. A Processor Utilization Resource Register (PURR) was introduced to measure the portion of processor capacity consumed by each thread. Each thread in the processor has a PURR. Logic in the processor calculates the fraction of instruction dispatch capacity allocated to each thread, which is an adequate measure of overall consumed processor capacity, and accumulates this “charge” to the PURR for each thread.
In prior designs, only instructions from the same thread could be dispatched on the same processor cycle, and thread priority was managed by allocating more or fewer dispatch cycles to each thread. The processor utilization “charge” for each thread was accumulated to each thread's PURR value, and was implemented as simply counting processor cycles which dispatched instructions from each thread. On cycles where no instructions were dispatched, for example when two threads encounter cache misses at the same time, the cycle was charged to which ever thread most recently had a prior dispatch cycle.
The TB is used in conjunction with the PURR. In the prior art, since the TB counted processor cycles, and the PURR registers across all threads counted dispatch cycles for each thread, the sum of the PURR values across all threads was always equal to the TB value multiplied by the number of cycles between increments of the TB. The fixed processor cycle relationship between the TB and PURR allowed software to calculate the accumulated charge for its thread to be calculated from the TB and PURR for its own thread, without requiring the PURR value for the other thread. This is a requirement because software executing on one thread does not have access to other thread's dedicated resources.
One significant limitation in known SMT systems is that they do not support dynamic variation of processor frequency. Today's processors must operate in an environment where the frequency is adjusted dynamically during steady-state operation in order to optimize power consumption and operating temperature. Wall-clock time cannot be calculated from a cycle counter if the cycle period is not known or fixed. Hence, the TB register can no longer be implemented as a cycle counter. Without the fixed cycle relationship between TB and PURR, software cannot determine the portion of processor utilization from only the TB and the PURR for its own thread.
A possible alternate solution involves sharing thread-specific resources across threads, which is architecturally undesirable, and requires a different algorithm for calculating the utilization which breaks downward software compatibility. Another undesirable solution is to modify the “known” cycle period and base time offset every time the frequency is changed. However, it is difficult to precisely control the process of modifying the “known” cycle period and base time offset every time the frequency is changed, and will result in accumulating inaccuracy. This modification process will also impact performance and the ability to optimize the operating frequency.
Another limitation of the prior art is that determining the utilization charge by simply counting dispatch cycles for each thread does not allow instructions from multiple threads to be dispatched on the same processor cycle. Overall processor throughput is improved if instructions from lower priority threads can be dispatched during the same cycle as a higher priority thread if the higher priority thread is not able to utilize all the available processor resources that cycle. In order to accurately represent the portion of a processor's resources allocated to a thread, where instructions from multiple threads are dispatched on the same cycle, the calculation of the charge for each thread must take into account the portion of the total resources allocated to each thread each cycle.
Therefore, it would be advantageous to have a technique for accurately measuring the useful processor capacity allocated to each thread in a simultaneously multi-threaded processor, where instructions from multiple threads may be dispatched on the same cycle, within a data processing system which allows dynamic variation of processor frequency to optimize performance and temperature.