A virtualized cluster is a cluster of different storage nodes that together expose a single storage device. Input/output operations (“I/Os”) sent to the cluster are internally re-routed to read and write data to the appropriate locations. In this regard, a virtualized cluster of storage nodes can be considered analogous to a collection of disks in a Redundant Array of Inexpensive Disks (“RAID”) configuration, since a virtualized cluster hides the internal details of the cluster's operation from initiators and presents a unified device instead.
In a virtualized cluster, data may also be mirrored between nodes such that copies of the data are stored in two or more locations. In a mirrored system, the data may still be available at a second node should a first node become unavailable because of hardware failure, network congestion, link failure, or otherwise. In a mirrored system, the data on each node is duplicated to other storage units. Duplication can be made at the same time as an initial write I/O or it can be done later, in a background operation. When the duplication is done at the same time as an initial write, it is called a synchronous duplication. Synchronous replication is a form of inline replication. Every I/O to the primary server is replicated to the secondary server in-line before the application server is acknowledged. In contrast, a later duplication performed in the background may be called an asynchronous duplication. In either synchronous or asynchronous mirroring systems, one of the main requirements of operation is to maintain the consistency of data across all of the mirror nodes. This results in predictable data retrieval irrespective of the mirrored storage node from which the data is accessed.
Data can be written to a storage node by issuing an I/O request to the node. The I/O request is issued by an initiator. The initiator may be another node, a computer, an application on a computer, or a user of a computer. When data is written to a storage node, that node may be referred to as a primary node. The primary node may then mirror the data to one or more other nodes that can be referred to as secondary nodes. It is an important operational requirement that data between mirrored nodes be consistent. Because all of the data writes at each respective one of the mirrored volumes may not be instantaneous, or atomic, data inconsistencies may occur due to any one of various pathological scenarios.
One pathological scenario occurs when the primary node stores new data and then attempts to mirror the data to a secondary node, but the attempt fails. This failure may be due to a network link failure, a hardware failure at the secondary, or other factors. Another pathological scenario occurs when the primary node stores data and then mirrors the data to a secondary node but the secondary system suffers a power failure before or during the write of the new data to disk. In all of these scenarios, and other mirroring failure scenarios, the nodes may eventually come back on line with inconsistent data on mirrored nodes. This is undesirable since an initiator may now retrieve different data depending upon which mirrored node the request is issued.
A drive cache is generally data stored in memory that duplicates data stored on the associated disk drive. Since memory is typically much faster than a drive, the drive data is slow to fetch relative to the speed of reading the cache. In other words, a cache is a temporary, fast storage area where data can be stored for rapid access. Once data is stored in a cache, future use can be made by accessing the cache instead of accessing the slower drive data. In a write-through cache system, every write is written to both the cache and the drive. In contrast, a write-back cache system stores every write into the cache but may not immediately store the write into the drive. Instead, the write-back cache system tracks which cache memory locations have been modified by marking those cache entries as “dirty”. The data in the dirty cache locations are written back to the drive when triggered at a later time. Writing back of the dirty cache entries upon such a trigger is referred to as “flushing the cache” or “flushing the cache to disk”. Example triggers to flush the cache include eviction of the cache entry, shutting down the drive, or periodic cache flushing timers. A write-back cache system is also referred to as a write-behind cache system.
Additional complications to the pathological scenarios described above occur when a write-back cache is used in a primary storage node and/or a secondary storage node. For example, both a primary storage node and a secondary storage node may have received the same data to be mirrored, but the data is cached and has not yet been flushed to disk when one of the nodes suffers a power failure. In this instance, one of the data write I/Os was received but not made persistent on the disk drive. Thus, the data will be inconsistent between the two storage nodes after the power failure completes.
In such cases, resynchronization is needed to bring back the replication solution to an optimal state. A known solution of resynchronization uses write intent logging known as gating. Gating tracks every I/O that could cause a difference. Though gating solves some issues with respect to link failures and write-back cache phenomena in primary and secondary nodes due to abrupt power failures, gating adds the additional overhead of maintaining gate tables and bitmaps in the primary and secondary nodes and persisting these bitmaps across reboot. Moreover, tracking and persisting every block which receives an I/O and serializing this operation prior to the actual I/Os add write latency for application server I/O's.
It is with respect to these considerations and others that the disclosure made herein is presented.