The present techniques relate to computer systems. More specifically, the techniques relate to failure detection for central electronics complex (CEC) group management.
Business continuity requires that a user, such as a company, have a highly available information technology (IT) infrastructure. Cluster-based high availability solutions may provide high availability management. However, with the sprawl of IT infrastructure, cluster-based high availability management may be relatively complex to deploy and manage. An underlying physical infrastructure may be provisioned as a central electronics complex (CEC) group. CEC group management may provide high availability management for a group of computer servers that host a number of logical partitions (LPARs). The CEC is a building block of central processing units (CPUs), physical memory, and peripheral component interconnect (PCI) backplane that is interconnected. At the basic level, a CEC may be one or more physical servers. A CEC group may provide and monitor a relatively large number of CECs and LPARs (e.g., in the order of hundreds) in an IT infrastructure, and relocate individual LPARs, or entire LPARS within a CEC, in the physical IT infrastructure as needed.
LPARs are virtualized via virtual input/output servers (VIOSes). The physical computer resources of the CEC, including but not limited to memory and network adapters, are not dedicated to individual LPARs, but rather are shared among the LPARs via the VIOSes, which own the physical computer resources. Each VIOS may run within its own LPAR. A VIOS provides virtualized storage for its associated LPARs. Therefore, each VIOS needs enough storage space for the associated LPARs. The storage space may be provided by a disk storage system, for example, in a storage area network (SAN) environment; however, any other appropriate storage system or group of local disks may be supported and managed by VIOS.