Users of computing systems expect extremely high reliability from the underlying computing infrastructure, and demand zero loss of data even if there is a failure event in a component of the computing infrastructure.
Some computing system component manufacturers have approached the need for extremely high reliability by designing-in redundancy into individual components (e.g., network components, computing components, storage components, software components, etc.) of the computing system. In particular, nearly every non-volatile storage component manufacturer supplies combinations of hardware and software that are intended to address device failure detection. Manufacturers address failure detection to varying degrees. Some manufacturers design-in minimal capabilities, such as for merely reporting the occurrence and nature of a detected failure. Some manufacturers address failover and failback in various ways as well.
Unfortunately, the aforementioned manufacturers cannot predict what might be appropriate remediation (e.g., to remediate failover or to handle failback) in the event of a device failure. More specifically, manufacturers cannot predict what might be appropriate actions to take in all systems and/or in all failure conditions. For example, although some approaches involve hardware timers to assist with device locking and “cutoff” in the event of a failure (e.g., a failure of a controller), in many systems and under many circumstances, it might not be appropriate to “cutoff” or “shutoff” a component merely on the basis of a hardware timeout. Rather, various system conditions should be examined so as to determine whether or not synchronization and other operations need to be performed before a “cutoff” or “shutoff” or “cutover”.
Indeed, in some cases (e.g., in some systems and/or under some conditions), a hardware-assisted “cutoff” and/or “cutover” to a standby device is the wrong action to take. As an example, it would be inappropriate to “cutover” from one device to another device if there is I/O (input/output or IO) pending for the soon-to-be “cutoff” device. Furthermore, even though most non-volatile storage device manufacturers do supply software drivers to coordinate with the supplied hardware devices, the approaches and delivered support for all of the features needed for high availability varies by vendor. In certain cases, the manufacturer-supplied driver implementation of high-availability features might buggy, or missing needed features, and/or deficient in other ways. What is needed is a way to achieve zero errors and zero downtime in a volatile storage appliance that does not rely on the presence or quality of the device manufacturer's supplied high-availability features.
Some of the approaches described in this background section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.