Ever increasing demand for high throughput data processing systems has caused computer designers to develop sophisticated multi-processor designs. Initially, additional processors were provided to improve the overall bandwidth of the system. While the additional processors provided some level of increased performance, it became evident that further improvements were necessary.
One way to improve system performance involves the use of partitioning. Partitioning refers to the allocation of the system's data processing resources to a number of predefined “partitions”. Each partition may operate independently from the other partitions in the system. Using partitioning, a number of parallel tasks may be executed independently within the system. For example, a first portion of the system resources may be allocated to a first partition to execute a first task while a second portion of the system resources may be allocated to a second partition to execute a second task.
System resources may be allocated to partitions by a system controller based on the tasks being executed within the data processing system at a give time. For example, a system controller may add resources to a partition that is currently processing a very large task, and may remove resources from a partition servicing a smaller task, thereby increasing the efficiency of the overall system. U.S. Pat. No. 5,574,914 to Hancock et al. describes a system and method that utilizes a site management system that is capable of moving resources between multiple partitions in this manner based on the requirements of the data processing system.
One problem with partitionable systems involves error recovery. In a system that supports multiple partitions, some resources will generally be shared between the multiple partitions. When a partition experiences a failure, some mechanism is needed to remove the affects of the fault from the common resources so that other non-failing partitions can continue to utilize that resource. For example, a common main memory may receive and process requests from more than one partition. When a fault in one partition occurs, a mechanism is needed to remove all requests and responses, as well as the affects of those requests and responses, from the various queues and other logic included within the common memory.
Prior art partitionable systems address the foregoing problem by forcing the common resource to discontinue processing requests from both the failing, as well as the non-failing, partitions after a fault is detected. The logic of the common resource is then re-initialized, as may be accomplished using a maintenance processor. Once re-initialization is complete, request processing resumes for the non-failing partitions. This method stops execution of the non-failing partitions at least temporarily, thereby impacting system throughput.
Another problem with partitionable systems is that the logic required to isolate an error within a partition is generally quite extensive. For example, prior art systems provide dedicated error reporting and recovery interfaces that can be enabled to allow error recovery activities to occur on a partition-by-partition basis. The extensive nature of the required logic increased power utilization and consumes logic resources.
What is needed, therefore, is an improved error recovery mechanism for a partitionable system that allows partitions that are unaffected by a fault to continue making requests to shared resources in a manner that is not impacted by recovery operations. The mechanism ideally takes advantage of existing system interfaces so that error reporting and recovery is completed without the use of dedicated interfaces and extensive circuitry.