1. Field of the Invention
This invention relates to distributed systems, and more particularly to apparatus and methods for alerting system administrators of failure conditions within distributed systems.
2. Background of the Invention
In an effort to improve scalability and fault tolerance, increasing numbers of businesses and enterprises are turning to distributed systems to implement their computing infrastructures. As an example, small and medium businesses are turning to modular solutions such as IBM BladeCenters or similar products of IBM's competitors. As distributed systems become more and more prevalent, one need that arises is the ability to service these systems in a consistent fashion.
A distributed system typically contains more than one processing or storage device, and is often capable of running multiple processes or operations simultaneously. Distributed systems may house multiple components within a chassis, such as the IBM BladeCenter-S chassis. Distributed systems are often used to perform complex tasks such as computationally expensive research or managing large web services.
There is currently a need for systems that provide a concise and consistent alert protocol across multiple systems and hardware vendors. In some implementations, a single or small number of LEDs may aggregate many alert conditions. This can be disadvantageous in that the failure alerts generated can be confusing or inadequately describe the failure condition. In addition, certain failure conditions may require removal of a device, whereas other failure conditions may require servicing the device without removal. When a device has failed it may be detrimental to remove the device. For example, it may be detrimental to remove a network device that has encountered a failure and is in the process of performing a shutdown sequence.
This problem may be further complicated in distributed systems where there are many devices, each of which may be configured to generate alerts to the system administrator. For example, many distributed systems have redundancy such as dual disk controllers built into them. A system administrator may undermine the integrity of the system if a failed device is removed that reduces the redundancy in the system.
Current systems often have a set of predefined alerts that may be sent to a system administrator. As hardware is constantly changing, this predefined set of alerts may soon become obsolete, and hardware or software developers may have to choose between a number of predefined alerts that do not accurately describe the error, or may not be sufficient to identify the error at all. In addition, developers may not know all possible error conditions that may occur when a device is first developed, and may wish to be able to represent new error messages when the device's software or firmware is updated.
In some distributed systems, each device (processing unit, storage unit, power unit, etc.) has its own set of LEDs to provide information to the system administrator. These devices may be connected to a management module that also has its own set of LEDs, such as its own fault LED. Each of these devices may be mounted in a chassis, which may also have its own set of LEDs. Whenever a fault is generated by any of the devices mounted in the chassis, a chassis LED may be lit. This may be undesirable because the chassis LED may indicate that an error has occurred even though no hardware needs to be removed, making it difficult for a system administrator to determine which components in a potentially large distributed system have encountered an error condition and need to be replaced.
In view of the foregoing, what are needed are apparatus and methods to provide meaningful alerts and visual indicators (e.g., LEDs) to enable system administrators to identify and handle failure conditions within a distributed system.