The present disclosure relates generally to resolving abnormal contention and, more specifically, to a method and apparatus for resolving abnormal contention at a hypervisor level on a computer system for a serially reusable resource.
In computer system workloads there are often a number of transactions that make up jobs, and a number of jobs that make up a program, which are all vying for some of the same limited resources, some of which are serially reusable resources such as memory, processors, and software instances. In such computer system workloads, there may be many relationships between jobs, transactions, and programs that are increasingly dynamic creating complex resource dependency scenarios that can cause delay. For example, when a thread or unit of work involved in a workload blocks a serially reusable resource, it slows itself down and other jobs and/or transactions going on concurrently across the system, the entire system complex, or cluster of systems, which are waiting for the resource. In mission critical workloads, such delays may not be acceptable to the system and a user.
Further, a system may include Logical Partitioning (LPAR) which can include a notion of a computing weight. The computing weight can be defined as a maximum computing power allowed for a single system image running on top of LPAR. This may hamper a system image's CPU time when the computer system is run at full capacity. LPAR also has a notion of soft capping, where an artificial computing limit can be imposed upon an image, in order to control the amount of processing work a computer can perform, for example, in one hour which can be measured using a measurement such as million service units (MSU) consumed. This can take effect before the image reaches potential capacity, and can become a bottleneck. Another cause for hypervisor level resource bottlenecks can be system images configured with only a single processor which can be called a uni-processor arrangement.
Additional delays may be caused by human factors. For example, one such factor that can lead to delays in a reduction of IT staff in an IT shop or department as well as the inexperience of the IT staff below a threshold for providing sufficient support thereby causing delays. Some automation may be utilized to help alleviate delay, however, automation may not have enough intrinsic knowledge of the system to detect or make decisions regarding delays or the causes of the blocking jobs.
An operating system of the future is envisioned that can monitor such workloads and automatically resolve abnormal contention (with greater accuracy) to help recover from delays in order to provide increased availability and throughput of resources for users. These types of analytics and cluster-wide features may help keep valuable systems operating competitively at or above desired operating thresholds.