This invention relates to deconfiguring a system or system components, and particularly to determining a minimally degraded configuration when failures occur along chip connections.
In today's server systems, it is quite likely to have many cooperative processors all contained within a single system. These processors have communication paths (e.g., connections) to transfer data between fellow processors within the system. As the number of processors in a system increases, the probability of having an error along one of these communication paths increases. Further, errors may occur in components on a multi-path network (e.g., in a communication path), such as processing nodes in a supercomputer or a multi-path input/output (IO) system, where the paths between the processor and hard-drives are through two or more IO adapters.
In order to facilitate high-end reliability, availability, and serviceability (RAS) capabilities in server systems or other systems, operators have the ability of deconfiguring a hardware entity, such as a processor, when a failure occurs. When the failure is along one of the communication paths between two processors, the failure is typically detected at one or both ends of the communication path and is translated, by diagnostics software, into a “connection deconfiguration” event.
Typically, large server configurations involve a number of tightly coupled processors on a single system board or drawer. These drawers of tightly coupled processors are attached via interconnects to a number of other drawers to create the full server system. Due to hardware implementation or for firmware simplicity, deconfiguring a single processor on a drawer may require deconfiguring some or all of the other processors on the drawer. These secondary deconfigurations are called “associative” deconfigurations.
The current implementation of handling a “connection deconfiguration” event is to deconfigure the two hardware items on each end of the connection. If multiple “connection deconfiguration” events occur on a system, this implementation could lead to many more processors being deconfigured than necessary. When coupled with the “associative” deconfigurations, this can lead to greatly reduced system performance, potentially leaving the system with only one drawer or even one processor left functional.
It is therefore desirable to have a method and computer program product that provide minimal deconfigurations when failures occur.