Computer servers are utilized in a variety of applications. Systems, such as server systems executing purchase transactions for an Internet-based merchant, usually have a high-availability design. Such high availability systems are designed to be running and available at all times. Thus, the need for powering down and rebooting these systems for any reason for example maintenance, upgrades, system crashes, troubleshooting, etc., has to be avoided as much as possible. Typically, these systems have a plurality of expansion slots, wherein a card is coupled to the slot. Mass storage devices and network adapters are examples of components that may be connected via such cards. Generally, one or more further devices may be coupled to the card via cables.
To maintain high-availability, online card operation capability has been integrated into high availability computer systems. The online hot plug operation capability enables a user and/or a system administrator to perform online hot plug operations without powering down and rebooting these systems. Examples of online card operations include adding a card to the system by coupling it to a slot, replacing an existing card that is coupled to a slot with another card, and removing a card from the system by uncoupling it from a slot while the system is running. These online hot plug operations generally require that particular drivers for the card be suspended and that the power to the slot(s) of interest be shut-off before a card can be added, replaced, or removed. Generally, slot power control and drivers may facilitate these online card operations. In some systems, several slot power domains are configured, wherein slots in each slot power domain share a common power line. If an online card operation is performed on any slot in a slot power domain, then all the slots in the slot power domain may lose power, increasing the complexity of performing the online card operation.
Before such an online hot plug operation, typically a resource analysis is performed on the computer systems. The resource analysis may also be useful when groups of cards are taken offline in a single operation such as when an entire chassis of cards is removed from the computer system while the system is running. The resource analysis may also be referred to as a “critical resource analysis” (CRA). The CRA may analyze and report the impact of powering down each slot associated with any card that is involved in any attempted online hot plug operation example, adding, replacing, or removing card(s). This may require identifying affected resources of the system. Conventionally, the identified affected resources are assigned a low severity level or warning level and a high severity level or critical level. If the identified affected resources are essential for system operation, they are assigned the critical level. This indicates that if the slot(s) is powered down causing the unavailability of the functionality of the card(s) coupled to it, the system likely will crash or enter an unhealthy/failed state. The user is generally prevented from performing the online card operation if an identified affected resource is assigned the critical level so that the system keeps running to maintain the desired system availability level. The manner of determination of whether an identified affected resource is “essential for system operation” may vary among different systems. If the identified affected resources are not essential for system operation, they are assigned the warning level. This indicates that if the slot(s) is powered down causing the unavailability of the functionality of the card(s) coupled to it, the system likely will not crash or enter an unhealthy/failed state.
In general, CRA performs a series of checks to conclude if the card and/or slot and the resources/devices associated with it are essential to system operation. The CRA functionality is intended to keep the system running and avoid inadvertent reboots as well as prevent the system from getting into an unhealthy state.