1. Field of the Invention
This invention relates in general to computer systems, and in particular to systems management of computer systems.
2. Description of Related Art
Large enterprise computer systems are a very difficult environment to manage. System management must not only handle a wide range of events, such as power failures, fan failures, disk failures, complex status changes (such as rebooting), etc., it must also handle them in parallel across a large number of computer systems and cabinets, some of which may be geographically remote.
System management for large enterprise computer systems is complex, because not only does it have to detect failures, but it must quickly notify every part of the system that may be impacted by this failure. In a large system, it is extremely difficult to determine the impact of an event without an intimate and detailed knowledge of the system. When the system is large, the logistics involved in the distribution of events (even when the system management knows to "whom" to send these events) is no longer simple, straight-forward, or low cost.
Even processing a single event in a large system can become quite complex. That single event may need to be processed concurrently by several different processes in order to meet the reliability and serviceability goals required by the "glass house" computing market. For example, one process may communicate the data in the event to the user via a system console. Yet another process may use the same event to build a knowledge base for predicting specific component failure in a system as a method for improving system availability.
The management of large enterprise computer systems has traditionally been based on a centralized, monolithic design which uses point-to-point communication to connect a single administration console to the set of managed computer systems. However, this centralized approach imposes scalability and connectivity limits that in turn limits how large the computer systems can grow, and is vulnerable to single point of failure, since a backup console is not possible. The monolithic nature of the centralized approach does not easily adapt to change.
Further, since the system console is the centralized collection point for events, applications which extract data from the events in real-time tend to be located on the system console for performance issues. As event processing becomes more complex (to extract more information out of the event and perform more processing on them) and the number of events increases with larger and faster systems, the resources of the central collection point, the console, are very quickly consumed. The end result of this is clearly visible to the customer through the severe performance impact on console management and display functions.
One problem with this centralized model is that it couples the performance of system management to a component (the console) whose performance does not scale automatically whenever the system is expanded (with additional or more powerful systems). Another problem is that a centralized event distribution system may create a single point of failure that could require significant software and hardware expenditures to eliminate.
Thus, there is a need in the art for an infrastructure or architecture that provides efficient distribution of events across every computer system and cabinet.