1. Field of the Invention
The present invention generally relates to the field of clusters of computer systems. More particularly, the present invention relates to methods, systems, and media for correlating error events associated with clusters.
2. Description of the Related Art
Growing demands for expandability, reliability, and flexibility with regards to processing power for computer systems has outdated traditional networked environments for many applications. As business use of computer systems grow, customers are becoming critically dependent on their information technology resources, demanding that these resources are always available. Outages can seriously impact business, causing lost revenue and lost business. For instance, depending upon the nature of the business, system downtime can range from thousands to millions of dollars.
Clusters offer the continuous availability required by many businesses. A cluster is a collection of one or more systems that work together to provide a unified computing environment. Clusters can be interconnected with high-speed loops such as local area network (LAN), Opticonnect, and asynchronous transfer mode (ATM) to provide high-speed communications and switchover for data and application resiliency. From the customer's perspective, clusters can operate as a single system while data and applications are actually distributed across multiple systems. Distribution of data and applications from system to system within the cluster is performed in a relatively transparent manner so planned and unplanned outages will not disrupt services provided to the customer.
Maintenance of clusters demands expeditious identification of errors. Accordingly, cluster management utilities on each system of the cluster monitor systems and loops for errors. In particular, systems have “heartbeat” monitors for monitoring for software and hardware errors, generating error events to describe errors, and forwarding the error events to the customer, and, in some cases to a maintenance provider such as IBM.
The independent generation of error events by multiple systems within clusters has created a new problem for the maintenance of clusters. More specifically, when more than one system identifies an error, multiple error events are generated by the systems and reported to the maintenance provider. Moreover, an error can affect systems of the cluster in different ways, so each system reports the error based upon the affect the error has on the reporting system, creating a multitude of error events that appear to be independent. For example, an error that opens the communication loop between systems of a cluster may be reported by each system connected to the loop. This problem is exacerbated when the maintenance provider is not an administrator of the cluster and, thus, the maintenance provider may not be intimately aware of the topology, or, at the extreme, the maintenance provider may not be aware that the systems are connected to a cluster. Further, in the event of a catastrophic error or site error, systems at the site of the error or in the immediate vicinity of the error may be unable to forward error events to the maintenance provider.
Receipt of error events that appear to be independent, complicates repair actions. The different symptoms reported can lead to dispatch of multiple potential replacement parts and performance of complicated tasks by the service technicians. Current solutions involve drafting service procedures that instruct service technicians to look at the errors reported on all the systems of the same cluster. The service procedures conservatively attempt to identify the actual source of an error without eliminating independent errors, based upon a generic model of a cluster that fails to fully account for differences associated with specific cluster configurations designed by or for different customers. Thus, to avoid elimination of independent errors, maintenance providers may have to address multiple error events that result from the same error.