The present invention is directed to the field of computer systems, and more specifically to a unified, workload-optimized, adaptive Reliability, Availability, and Serviceability (RAS) for hybrid systems.
Hybrid systems are heterogeneous computing environments and may include a combination of servers with different architectures or instruction sets. Some of these servers may be highly reliable, such as the IBM System z and other mainframe systems. Other components of a hybrid system may include commodity attachments such as appliances, blades such as x86, and accelerators such as graphical processing units (GPUs) and FPGAs (Field Programmable Gate Arrays). These commodity attachments may have a lower Reliability, Availability, and Serviceability (RAS) than high-end mainframe systems.
Assume a system of systems with System A and System B. Let R denote a reliability function relating to the probability that an entity will not fail at time t given that it did not fail at time t=0. Assume R(A)>R(B). For a workload executing on this system of systems, the effective reliability is the product R(A)×R(B). If R(A) is 0.8 and R(B) is 0.1 then the effective reliability is R(A)×R(B) or 0.08. This value is lower than R(B). Thus, the effective reliability is lower then the weakest link in the chain.