The present application relates generally to an improved data processing apparatus and method and more specifically to non-volatile memory based reliability and availability mechanisms for computing devices.
System level reliability is a selected design constraint for many computing devices, such as server computing devices. Redundant computational units, reliability engines, and other dedicated reliability functionality are common practice in current high-end server designs. While reliability can be generally improved through such functionality, the recovery time is not improved since the reliability functions also fail with the rest of the server in a serious failure condition.
Most processors in the market today contain functionality for the sole purpose of improving reliability. While such functionality is effective in enhancing reliability, they are of limited use when there is a serious failure that causes the computing device, e.g., the server, to power down. Most of the data stored in the specialized controllers, table data structures, and other reliability engines are lost at power down.
In theory, such data in these reliability structures can be stored in external software logs which can be available after the server powers down. However, this involves data center level server logs which require specialized software to sort through a significant amount of data to analyze the source of failure. Furthermore, if the failure is caused by software, such information can be lost at power down since the state of the software is not maintained even in these external server logs, leaving not much meaningful data to do diagnostics.