A typical, complex, resource based, computing system, such as an advanced data storage array controller, requires large pools of memory in which queues are created for the aggregation of commands and operations to increase efficiency. An example of this type of queue would be a group of memory buffers for the aggregation of data to be written to a group of disk drives, allowing for burst write operations, which are more efficient and reduce overall system latencies. As load increases on these types of complex computing systems and resources reach exhaustion, it becomes necessary to begin storing incoming commands until resources become available. The computing power required to store, manage and retrieve commands increases overall latencies and can become quite inefficient; producing additional resource starvation until he system begins to bog down. In the most extreme cases, resource starvation can increase to the point where more system resources are being utilized to manage the low resource condition than is available for the actual work that the computing system is designed to perform.
By way of example, consider the concept of resource exhaustion applied to a data storage array controller, which uses a pool of cache memory buffers to store and aggregate data to be written to a group of mechanical storage devices. Once the available pool of cache buffers has been exhausted, the array controller begins storing incoming commands in a queue and wait for more buffers to become available. As more and more commands begin to back up in the waiting queue, command latencies grow, and the requesting devices begin to exhaust their command time out values. This causes the requesting devices to then issue command abort requests to the array controller, which forces the consumption of additional resources to locate and remove commands and data from the processing queues. In the most extreme of cases, so much computing power is being used to process command abort operations that most of the commands coming into the array controller end up being aborted by the requesting device, and what appears to be a deadlock occurs. In addition to the resource exhaustion in the array controller device itself, this command backup scenario extends to the systems making the requests as well, as they are forced to handle more and more abort and retry operations on top of the ongoing workload generating the requests.
One mechanism for resource management involves the constant monitoring of usage levels of critical resources within a computing system, and the rejection of requests as they are received, which require resources that are nearing exhaustion. Rejection of commands that require the nearly exhausted resource would then continue until such time as the amount of available resource increased to an acceptable level. This approach also has the advantage of allowing the requesting systems to be aware of the fact that resource exhaustion has occurred and allows them to implement algorithms of their own to deal proactively with the exhaustion rather than reactively with command aborts and retries. This method of resource management avoids the additional resource starvation created when long latencies begin to back up a computing system and large waiting queues build up, but it has been shown in fact to create several new problems, which need to be addressed. The first problem is that this type of resource management works like an on off switch, causing erratic system throughput and “saw tooth” performance curves. Requesting systems are ether allowed to run free, or are stopped down to executing only one command at a time. The second problem this approach creates is that one or a small number of requesting systems may consume all of the available resources in the system, thus creating possible long latencies for systems that have much lower usage levels. In usage modeling, it has been shown that the simple resource management scheme, while providing relief to the system it is running on, actually causes more problems than it solves on a system wide basis, and often results in the disabling of the feature in field installations. In some cases, specific computer operating systems perform so badly in an environment running this type of resource management scheme, that it must be disabled when systems running those operating systems are present in the environment.