1. Field of the Invention
The present invention generally relates to performing resource analysis on a system such as a computer system. More particularly, the present invention relates to performing resource analysis on one or more cards of a computer system to determine if it is safe to perform an online card operation while the computer system is running.
2. Related Art
Systems such as computer systems (e.g., servers) are utilized in a variety of applications. Some systems (e.g., servers executing purchase transactions for an Internet-based merchant) are required to have a high-availability design. That is, these systems need to be running and available at all times. Thus, the need for powering down and rebooting these systems for any reason (e.g., maintenance, upgrades, system crashes, troubleshooting, etc.) has to be avoided as much as possible. Typically, these systems have a plurality of slots, wherein a card is coupled to the slot. Devices (e.g., mass storage device) and network ports are examples of components that can operate off the card. Generally, one or more devices can be coupled to the card via cables.
To maintain high-availability, online card operation capability has been integrated into these systems. The online card operation capability enables a user (e.g., a system administrator) to perform online card operations without powering down and rebooting these systems. Examples of online card operations include adding a card to the system by coupling it to a slot, replacing an existing card that is coupled to a slot with another card, and removing a card from the system by uncoupling it from a slot while the system is running. These online card operations generally require that particular drivers be suspended and that the power to the slot(s) of interest be shut-off before a card can be added, replaced, or removed. Generally, slot power control and drivers facilitate these online card operations. In some systems, several slot power domains are configured, wherein slots in each slot power domain share a power line. If an online card operation will be performed on any slot in a slot power domain, then all the slots in the slot power domain will lose power, increasing the complexity of performing the online card operation.
Before the online card operation is performed, typically a resource analysis is performed. This resource analysis is also useful when groups of cards are taken offline in a single operation such as when an entire chassis of cards is removed from the system while the system is running. Typically, this resource analysis is also referred to as a “critical resource analysis” (CRA). The CRA analyzes and reports the impact of powering down each slot associated with any card that is involved in any attempted online card operation (e.g., adding, replacing, or removing card(s)). This requires identifying affected resources of the system. Conventionally, the identified affected resources are assigned a low severity level (or warning level) or a high severity level (or critical level). If the identified affected resources are essential for system operation, they are assigned the critical level. This indicates that if the slot(s) is powered down causing the unavailability of the functionality of the card(s) coupled to it, the system likely will crash or enter an unhealthy/failed state. The user is generally prevented from performing the online card operation if an identified affected resource is assigned the critical level so that the system keeps running to maintain the desired system availability level. The determination of whether an identified affected resource is “essential for system operation” may vary among different systems. If the identified affected resources are not essential for system operation, they are assigned the warning level. This indicates that if the slot(s) is powered down causing the unavailability of the functionality of the card(s) coupled to it, the system likely will not crash or enter an unhealthy/failed state.
In general, CRA performs a series of checks to conclude if the card and/or slot and the resources/devices associated with it are essential to system operation. The CRA functionality is intended to keep the system running and avoid inadvertent reboots as well as prevent the system from getting into an unhealthy state.