1. Field of the Invention
The present invention relates to management of a cluster in the event of a failure.
2. Description of Related Art
In certain computing environments, multiple host systems may communicate with a control unit, such as an IBM Enterprise Storage Server (ESS)®, for data in a storage device managed by the ESS receiving the request, providing access to storage devices, such as interconnected hard disk drives through one or more logical paths. (IBM and ESS are registered trademarks of IBM). The interconnected drives may be configured as a Direct Access Storage Device (DASD), Redundant Array of Independent Disks (RAID), Just a Bunch of Disks (JBOD), etc. The control unit, also known as a cluster, may include duplicate and redundant processing nodes, also known as processing complexes, to allow for failover to a surviving processing complex in case one fails. The processing complexes may access shared resources such as input/output (I/O) adapters, storage adapters and storage devices.
In the event a processing complex fails due to a hardware or software failure, the surviving processing complex detects the failure and takes control of all shared resources of the cluster. The processing duties which were being performed by the processing complex which failed are taken over by the surviving processing complex.
The resources of each processing complex may be divided into a number of logical partitions (LPAR), in which a computer's processors, memory, and hardware resources are divided into multiple environments. Each environment can be operated independently, with its own operating system and applications. Logical partitioning of a processing complex adds flexibility in workload management on a single server, with the ability to partition the single machine into many logical servers with their own sets of system resources. The resources in each partition may be combined in various amounts and combinations. Also, the number of logical hardware partitions that can be created depends on the hardware system.
Dynamic Logical Partitioning (DLPAR) extends the capability of LPAR by providing the ability to logically attach and detach the resources of a processing complex to and from the operating system of a logical partition without rebooting. This resource allocation can occur not only when activating a logical partition, but also while the partitions are running. Processor, memory, I/O adapter and other partition resources can be released into a “free pool,” acquired from that free pool, or moved directly from one partition to another within a processing complex, in various amounts or combinations. However, each partition generally has at least one processor, memory, an I/O adapter associated with a boot device, and a network adapter.
The movement of an LPAR resource from one hardware partition to another within a processing complex may be managed by a supervisor module. To transfer a partition resource, the supervisor module can send a network request to the logical partition which “owns” the partition resource, asking that source logical partition to release the particular partition resource and put it into a quiesced state. In this manner, the partition resource is stopped, and placed under control of a hypervisor module. The supervisor module can send a command to the hypervisor, instructing it to reallocate the partition resource from the source logical partition to a target logical partition. In addition, the supervisor module can send a network request to the target logical partition, instructing it to acquire the partition resource from the hypervisor module and configure it for use by the target logical partition.