Modern computing systems execute a variety of requests concurrently and operate in a dynamic environment of cooperative systems, each comprised of numerous hardware components subject to failure or degradation. The need to regulate concurrent hardware and software ‘events’ has led to the development of a field which may be generically termed ‘Workload Management.’
Workload management techniques focus on managing or regulating a multitude of individual yet concurrent requests in a computing system by effectively controlling resource usage within the computing system. Resources may include any component of the computing system, such as central processing unit (CPU) usage, hard disk or other storage device usage, input/output (I/O) usage, and the like.
Workload management techniques fall short of implementing a full system regulation, as they do not manage unforeseen impacts, such as unplanned situations (e.g., a request volume surge, the exhaustion of shared resources, or external conditions such as component outages), or even planned situations (e.g., system maintenance or data load).
Many different types of system conditions or events may negatively impact the performance of requests currently executing on a computer system. These events may remain undetected for a prolonged period of time, causing a compounding negative effect on requests executing during that interval. When problematic events are detected, sometimes in an ad hoc and manual fashion, the computing system administrator may still not be able to take an appropriate course of action, and may either delay corrective action, act incorrectly, or not act at all.
Contemporary workload management systems allow users to establish Service Level Goals (SLGs) for workloads (WDs). The SLGs are primarily used for reporting purposes, e.g., to gauge the success of the workload's performance and to note trends with respect to meeting those SLGs. One of the options is to establish an SLG based on response time with a service percentage. A second option is to define the SLG based on throughput rate (i.e., completions).
A second use of the SLGs is to automatically detect when SLGs are being missed. For example, one of the primary approaches used by database administrators (DBAs) and System Administrators is to first identify that there is a problem with their SLGs. Investigations into why will typically start with analysis at the system-level. If the system is not 100% busy and does not have heavy skewing, then typically the DBA will next check for blocked sessions.
However if the CPU is 100% busy, then the number of active sessions will be checked for unusually high concurrency levels. If some workloads have too many active sessions, then appropriate actions may be taken, such as to limit concurrency, to abort queries, and/or to make adjustments to the Priority Scheduler weights.
If the CPU is 100% busy and active sessions appear appropriate, the DBA may next check the CPU usage by WD and/or session to evaluate if there is a runaway query. From here, the DBA may take the appropriate action, e.g., to abort the offending request.
Notably, these investigations are triggered based on knowing that SLGs are being missed, enabling the DBA to act manually or automatically to resolve the situation, and bring WD performance back to SLG conformance.
A real-time event to detect when SLGs are being missed in order to either notify a DBA or application to take action, or to act automatically within the regulator, is straightforward when considering a response time oriented SLG. However, when the SLG is based on a throughput metric, detections may prove unnecessary when the reason for the missed Throughput-SLG is under-demand rather than the system's inability to provide service to achieve the target throughput level.
Therefore, what is needed is a mechanism that overcomes the described problems and limitations.