A virtualized cluster is a cluster of different storage nodes that together expose a single storage device. Input/Output operations (“I/Os”) sent to the cluster are internally re-routed to read and write data to the appropriate locations. In this regard, a virtualized cluster of storage nodes can be considered analogous to collection of disks in a Redundant Array of Inexpensive Disks (“RAID”) configuration, since a virtualized cluster hides the internal details of the cluster's operation from initiators and presents a unified device instead.
In a virtualized cluster, data may also be mirrored between nodes such that copies of the data are stored in two or more locations. In a mirrored system, the data may still be available at a second node should a first node become unavailable because of hardware failure, network congestion, link failure, or otherwise. In a mirrored system, the data on each node is duplicated to other storage units. Duplication can be made at the same time as an initial write I/O or it can be done later, in a background operation. When the duplication is done at the same time as an initial write, it is called a synchronous duplication. In contrast, a later duplication performed in the background may be called an asynchronous duplication. In either synchronous or asynchronous mirroring systems, one of the main requirements of operation is to maintain the consistency of data across all of the mirror nodes. This results in predictable data retrieval irrespective of the mirrored storage node from which the data is accessed.
Data can be written to a storage node by issuing an I/O request to the node. The I/O request is issued by an initiator. The initiator may be another node, a computer, an application on a computer, or a user of a computer. When data is written to a storage node, that node may be referred to as a primary node. The primary node may then mirror the data to one or more other nodes that can be referred to as secondary nodes. Again, it is an important operational requirement that data between mirrored nodes be consistent. Because all of the data writes at each respective one of the mirrored volumes may not be instantaneous, or atomic, data inconsistencies may occur due to any one of various pathological scenarios.
One pathological scenario occurs when the primary node stores new data and then attempts to mirror the data to a secondary node, but the attempt fails. This failure may be due to a network link failure, a hardware failure at the secondary, or several other factors. Another pathological scenario occurs when the primary stores data and then mirrors the data to a secondary node but the secondary system suffers a power failure before or during the write of the new data to disk. In all of these scenarios, and other mirroring failure scenarios, the nodes may eventually come back on line with inconsistent data on mirrored nodes. This is highly undesirable since an initiator may now retrieve different data depending upon which mirrored node the request is issued.
A drive cache is generally data stored in memory that duplicates data stored on the associated disk drive. Since memory is typically much faster than a drive, the drive data is slow to fetch relative to the speed of reading the cache. In other words, a cache is a temporary, fast storage area where data can be stored for rapid access. Once data is stored in a cache, future use can be made by accessing the cache instead of accessing the slower drive data. In a write-through cache system, every write is written to both the cache and the drive. In contrast, a write-back cache system stores every write into the cache but may not immediately store the write into the drive. Instead, the write-back cache system tracks which cache memory locations have been modified by marking those cache entries as “dirty”. The data in the dirty cache locations are written back to the drive when triggered at a later time. Writing back of the dirty cache entries upon such a trigger is referred to as “flushing the cache” or “flushing the cache to disk”. Example triggers to flush the cache include eviction of the cache entry, shutting down the drive, or periodic cache flushing timers. A write-back cache system is also referred to as a write-behind cache system.
Additional complications to the pathological scenarios described above occur when write-back cache is used in a primary and/or secondary storage node. For example, both a primary and a secondary storage node may have received the same data to be mirrored, but the data is cached and has not yet been flushed to disk when one of the nodes suffers a power failure. In this instance, one of the data write I/Os was received but not made persistent on the disk drive. Thus, the data will be inconsistent between the two storage nodes after the power failure completes.
It is with respect to these considerations and others that the disclosure made herein is presented.