The present disclosure relates generally to resolving abnormal contention and, more specifically, to a method and apparatus for resolving abnormal contention in a computer system for a serially reusable resource.
In computer system workloads there are often a number of transactions that make up processes or jobs, and a number of processes that make up a program, which are all vying for some of the same limited resources, some of which are serially reusable resources such as memory, processors, and software instances. In such computer system workloads, there may be many relationships between processes, transactions, and programs that are increasingly dynamic creating complex resource dependency scenarios that can cause delay. For example, when a thread or unit of work involved in a workload blocks a serially reusable resource, it slows itself down and other processes and/or transactions going on concurrently across the system, the entire system complex, or cluster of systems, which are waiting for the resource. Some of this slowdown and waiting is known as contention and is to be expected. However, abnormal contention, such as contention that will never end due to a deadlock, caused by a program defect, or is longer than normal, is of concern. In mission critical workloads, such contention and delays may not be acceptable to the system and a user.
Additional delays may be caused by human factors. For example, one such factor that can lead to delays in a reduction of IT staff in an IT shop or department as well as the inexperience of the IT staff below a threshold for providing sufficient support thereby causing delays. Some automation may be utilized to help alleviate delay, however, automation may not have enough intrinsic knowledge of the system to detect or make decisions regarding delays or the causes of the blocking processes. Further, knowing the correct action to choose when an abnormal contention event is detected is a difficult choice to make for both automation and human operators. Additionally, it can take thirty minutes or longer for an operator to respond to a console message, and once at the console, the operator would have to have an intrinsic knowledge of the related processes and resources to decide which action to take. Operator automation programs would fare worse, often simply picking a response without input from the system.
There are other approaches today that help in the attempt to detect and/or resolve serialization issues within a system or across a distributed environment such as deadlock detectors that either avoid or detect deadlocks and possibly take action such as terminating or rolling back a requestor to end the deadlock.
An operating system of the future is envisioned that can monitor such workloads and automatically resolve abnormal contention (with greater accuracy) to help recover from delays in order to provide increased availability and throughput of resources for users. These types of analytics and cluster-wide features may help keep valuable systems operating competitively at or above desired operating thresholds.