As customers consolidate more workloads on servers, they rely on their servers more than ever, and their expectations for server reliability, availability and serviceability increase. At the same time, smaller semiconductor features, high speed serial links and stringent power requirements are driving error rates up and require significant effort to meet user expectations while not significantly inflating the cost of the server.
Computer systems can be divided into two basic types: those for which the hardware and firmware allow more than one OS to be run at the same time in different partitions without any special features being present in the OS (partitionable) and those that don't (non-partitionable). Partitionable computer systems may have one or more cells or blades, with each cell containing a processor, memory, and I/O connections. Multiple cells may be stored in an enclosure and interconnected by high-speed communication links. A data processing system may include several enclosures interconnected and acting as one or more computer systems of one complex data processing system. Any set of cells may form a computer system, referred to as a partition, running on an operating system (OS).
In complex systems, a manageability processor typically exists in each cell. The manageability processors, connected to each other and to a common system administrator via an internal connection, such as an Ethernet or USB connection, cooperate to manage the complex system. System firmware (FW) also communicates with the manageability processors and helps manage partitions. System firmware runs on host processors and is similar to the BIOS on PCs, although it is broken into separate components with standard interfaces like the Processor Abstraction Layer (PAL) provided by Intel, and System Abstraction Layer (SAL) provided by Hewlett-Packard.
Computer systems conventionally monitor hardware errors on the system using OS-based agents. It is important that error monitoring software view the hardware continuously over time because a typical single corrected error is not meaningful. Today's integrated circuits have been miniaturized to the point where cosmic radiation is expected to cause an occasional bit-flip in large silicon structures. IC's have therefore been designed to detect and correct these errors, such as by using extra memory bits to implement an error correction code. A single error or even a few errors happening at the same time is not unexpected, and does not provide any indication that the integrated circuit is faulty. The situation is similar with high speed serial links: an occasional error can occur due to natural circumstances, can be detected and corrected, and does not imply a hardware problem. Hardware monitoring therefore relies on the analysis of errors that occur on an integrated circuit or high speed serial link over time.
Virtualization allows partitioned servers to make more effective use of partitioned resources by allowing resources to move dynamically between partitions as needed. Virtualization decouples an OS-instance from hardware, allowing a server to context swap between OS-instances. Resources, like CPU cores, can be shifted between partitions depending on system load and various other criteria. Resource shifting can take place while the OS's are running. Monitors running under an OS running in a partition only see errors that occur on hardware associated with the OS. Because resources may be moved between partitions, an OS may only have access to an incomplete history of the errors occurring in a particular hardware resource over time, or multiple OS's may duplicate error reporting for a shared hardware resource. On-line diagnostics that run in the operating systems in each of the partitions require field service personnel to log into each partition to discover the health of the system.