1. Technical Field
The present invention relates in general to cluster multiprocessing systems and in particular to event handling within cluster multiprocessing systems. Still more particularly, the present invention relates to rollup of queued events within a high availability cluster multi-processing system.
2. Description of the Related Art
High availability (HA) is gaining widespread commercial acceptance as an alternative to fault tolerance for mission-critical computing platforms. Fault tolerant data processing systems rely on specialized hardware to detect hardware faults and switch to a redundant hardware component, regardless of whether the component is a processor, memory board, hard disk drive, adapter, power supply, etc. While providing seamless cutover and uninterrupted performance, fault tolerant systems are expensive due to the redundant hardware requirement. Additionally, fault tolerant systems do not address software errors, a more common source of data processing system failure.
High availability utilizes standard hardware, but provides software allowing resources to be shared system wide. When a node, component, or application fails, an alternative path to the desired resource is quickly established. The brief interruption required to re-establish availability of the resource is acceptable in many situations. The hardware costs are significantly less than fault tolerant systems, and backup facilities may be utilized during normal operation.
Highly available systems are often implemented as clustered multiprocessor (CMP) systems. A cluster includes a plurality of nodes or processors connected to shared resources, such as shared external hard disks. Typically, each node runs a server or "back end" application permitting access to the shared resources. A node may "own" a set of resources--disks, volume groups, file systems, networks, networks addresses and/or applications--as long as that node is available. When that node goes down, access to the resources is provided through a different node.
Within clustered multiprocessing systems, it is advantageous to provide an event rollup function. In highly available clusters, various events may occur, including node failure, adapter failure, application failure, etc. Processing these events typically requires coordinated multiphase actions across the cluster, with barriers in between phases. However, certain events may be "rolled up" or subsumed within other events. That is, a first event may require only actions which are a subset of the actions required by a second event. Thus, occurrence of the second event while a response to the first event is pending obviates the need for specifically responding to the first event, since responding to the second event achieves the desired result.
Rolling up events may substantially reduce the overhead of event processing, particularly in large clusters. At least one prior art cluster software package--HACMP for AIX.RTM., available from International Business Machines Corporation of Armonk, N.Y.--provides some event rollup capabilities. However, only a limited set of events are rolled up: adapter failures are rolled up into node failures. Moreover, the rollup function is hardcoded into the cluster software and may not be specified or changed by a user. The event rollup information may not be altered by a user in any manner, much less dynamically (changing the event rollup information without having to stop and then restart cluster services).
It would be desirable, therefore, to provide cluster software with facilities permitting dynamic specification or alteration of event rollup information by a user.