The present invention relates generally to processors and computing systems, and more particularly, to a simultaneous multi-threaded (SMT) processor. The present invention also relates to processor utilization, system utilization, and system capacity accounting systems.
Modern micro-processors are usually superscalar, which means a single processor can decode, dispatch, and execute multiple instructions each processor cycle. These modern processors may also support SMT, which means each processor can concurrently execute more than one software program (thread) at a time.
SMT processors provide an efficient use of processor resources, as multiple threads may simultaneously use processor resources. Multiple threads are concurrently executed in an SMT processor so that multiple processor execution units, such as floating point units, fixed point instruction units, load/store units and others can be performing tasks for one (or more depending on the execution units' capabilities) of multiple threads simultaneously. Storage and register resources may also be allocated on a per-thread basis.
Accounting for processor time use is necessary for administration of computer services sales, as well as for internal cost-accounting management when, for example, some processor runs are for research and development activities that permit the hardware to be capitalized in a different manner for tax purposes than other uses. A server may be partitioned and processor time sold to multiple users “on demand” or on an as-used basis. Additionally, processor time may be utilized by hardware owners or lessors and also subcontracted out to entities paying for services. Therefore, accurate accounting for processor execution time is a necessity in computer architectural and software models.
In single-threaded processing systems, accounting is generally straightforward. A count of processor cycle use or even simple “wall-clock” time measurement can be provided for complete job runs, as even if multiple threads within multiple programs are executed, they are not executed simultaneously, but sequentially. A tally of cycle times is maintained until a job is complete and the total is presented for accounting purposes. The measured time correlates directly to processor resource utilization.
In an SMT processor, two or more threads may be simultaneously executing within a single processor core and the usage of resources by each thread is not easily determined by a simple execution count or time measurement. It is therefore desirable to provide a method and apparatus that can account for overall processor time usage in an SMT processor. It is further desirable to provide a method for accounting of resource usage within an SMT processor distributed among threads executing within such a processor.
Early core designs did not support the concurrent execution of multiple hardware threads. This made the measurement of utilization a fairly simple matter of accumulating the time (processor cycles) doing useful work and comparing that to the overall time available on the system. The fundamental assumption is that the time spent in idle intervals can be converted into the execution of code performing useful work at nearly the same rate of code executing in non-idle intervals. The ideal conversion ratio is linear.
In order to manage processor resources most effectively, the operating system must know the portion of a processor's resource capacity that is actually consumed by each thread. A Processor Utilization Resource Register (PURR) was introduced to measure the portion of processor capacity consumed by each thread in an SMT processor core. Each thread in the processor has a PURR. Logic in the processor calculates the fraction of instruction dispatch capacity allocated to each thread and accumulates this “charge” to the PURR for each thread. The PURR attempts to divide a single interval of time amongst multiple concurrent hardware threads. The average non-idle PURR accumulation per thread divided by collection period defined the basic processor utilization metric. In previous designs, PURR accumulations were determined primarily by the logic that controls instruction dispatch policies. These accumulations were heavily influenced by thread priority and had few adjustments.
One of the primary customer applications of processor utilization is to estimate system utilization, which is used in turn by the customer to estimate system capacity. No processor utilization measurement method will exactly match system utilization for all workloads.
Finding a more adaptable and a more accurate hardware metric to estimate system utilization that is both simple to implement and durable across chip designs is a difficult task. Many approaches were tried and found lacking.
Custom combinations of existing performance counter data provided a more accurate determination of system utilization. This method worked well in the lab for development work. To support a similar performance counter-based method in the field, monitoring frameworks would have to be designed to collect performance counter at the system level at all times, which is no small task. In future chip designs, the performance counter mixture could be designed into the core logic. However this method is not attractive at this time as it is difficult to precisely predict which counter combinations would be necessary to measure utilization in future chip designs. Each chip design may necessitate a different mix of counter data.
Not having a hardware solution for conventional systems, software-only work-arounds were explored. For example, changing the priority of the idle task will change its PURR accumulation in conventional systems. The higher the priority of the idle task, the lower the net utilization measurement becomes. Both static and dynamic mixes of priority were investigated. This method was rejected for customer deployment as raising the priority of the idle task noticeably degrades the performance of other threads in the core. This adversely affected commercial batch workloads which spend a significant amount of time executing as single threads in the core.
An alternative software method was later developed, wherein the task dispatcher in the operating system periodically redistributes PURR counts from the running software threads into the accumulator for the idle task. The amount to redistribute was adjusted by trial and error. However, this solution does not scale well if the number of threads per core increases to the point where operating system PURR adjustments are rendered inaccurate.
What is needed is a device and method that is operable to accurately estimate a system's maximum capacity to perform useful work based on an accurate measurement of a usage metric at the hardware thread level.