During the past decade, there has been tremendous growth in the usage of so-called “cloud-hosted” services. Examples of such services include e-mail services provided by Microsoft (Hotmail/Outlook online), Google (gmail) and Yahoo (Yahoo mail), productivity applications such as Microsoft Office 365 and Google Docs, and Web service platforms such as Amazon Web Services (AWS) and Elastic Compute Cloud (EC2) and Microsoft Azure. Cloud-hosted services are typically implemented using data centers that have a very large number of compute resources, implemented in racks of various types of servers, such as blade servers filled with server blades and/or modules and other types of server configurations (e.g., 1U, 2U, and 4U servers).
A key measure for both server OEMs (Original Equipment Manufacturers) and CSPs (Cloud Service Providers) is Reliability, Availability, and Serviceability (RAS). While RAS was first used by IBM to define specifications for its mainframes and originally applied to hardware, in today's data centers RAS also applies to software and networks.
Server OEMs and CSPs have continually indicated that hardware RAS handling needs to continue to improve on system resilience. In today's virtualized environments, multiple virtual machines (VMs) are run on host processors with multiple cores, with each VM hosting a set of one or more services (or applications/processes associated with such services). In the event of a processor fault requiring a reset, the resulting loss of host availability might apply to dozens of services, representing a significant cost. Specifically, when the silicon (i.e., processor hardware) sees an uncorrectable error, it will signal a machine check and typically log an error with PCC=1 (processor context corrupt). Such errors require that the silicon be reset in order to function properly. Under current approaches, this necessitates taking down any software running on the processor, including all VMs and thus all software-based services hosted by the VMs.
However, on the other hand, hardware advancements include support for various features relating to resiliency, such as enhanced Machine Check, firmware-first handling of errors, and memory poisoning and the like to ensure higher levels of availability (system uptime) and serviceability. While these advancements are noteworthy, OEMs and CSPs would like to be enabled to take advantage of these capabilities in a larger fashion than currently available today.