In data center environments, host-based caching is becoming an increasingly popular technique for optimizing virtual machine (VM) access to data residing on shared storage. Generally speaking, a host system that supports this technique caches, in a portion of a local, high-speed storage device (e.g., a solid-state disk (SSD)) known as a “host cache,” data that its VMs access from a shared storage system (e.g., a networked storage array). When the host system detects a VM read request for data that is already available in the host cache, the host system retrieves the data directly from the local storage device rather than performing a roundtrip to/from the shared storage system, thereby improving VM read performance.
One aspect of managing host-based caching involves determining how to handle VM write requests. With a “write-through” approach, the host system saves data for a write request synchronously in both the host cache of the local storage device and the shared storage system. Once the data is committed in both locations, the host system returns an acknowledgment to the originating VM indicating write completion. This approach has the benefit of maintaining the data in the host cache if a subsequent read request is made (and thus avoids network and storage disk latency when servicing the read request), but does not leverage the speed/locality of the local storage device to improve write performance.
With a “write-back” approach, the host system initially saves data for a write request solely in the host cache of the local storage device, without performing a synchronous save in the shared storage system. Once the data is committed in the host cache (referred to in this context as a “write-back cache”), the host system immediately returns an acknowledgement to the originating VM. At a later point in time, the host system flushes the data (considered “dirty data”) from the write-back cache to the shared storage system, thereby completing the actual write process. The timing and manner in which this flushing occurs depends on the particular write-back policy that the host system uses (e.g., storage system-optimized, cache-optimized, in-order commit, etc.). Since the VM can proceed with its processing as soon as the host system completes its write to the cache, write-back caching offers significantly lower write latency than write-through caching. Thus, write-back caching is generally preferable over write-through caching for write-intensive or mixed read/write workloads.
Unfortunately, write-back caching suffers from its own set of disadvantages and pitfalls. For instance, in some situations write-back caching can result in data loss, which is a condition where a portion of the data written by a VM (and cached in the write-back cache) is not propagated to the shared storage system. This can occur if, e.g., the host system crashes or otherwise fails before all of the dirty data in the write-back cache can be flushed. In these cases, it generally will not be possible to access the lost (i.e., unflushed) data until the host system is restarted. If the failure occurred at the host system's local storage device (and there is no redundant backup), the unflushed data can be lost forever.
In other (or the same) situations, write-back caching can result in data corruption, which is a condition where the data on the shared storage system does not correspond to a valid storage state at any time during VM execution (in other words, the stored data is “inconsistent”). Data corruption can occur for a number of different reasons. For example, data corruption can occur on a recurring, but temporary, basis if the host system uses a write-back policy that flushes dirty data to the shared storage system in an order that is different from the order in which the data was originally written by the originating VM(s). To illustrate this, assume that the host system receives sequential VM write requests to blocks b1, b2, b3, and b4, but flushes these blocks in four separate flush operations in the alternative order b2, b4, b3, and b1 (for, e.g., storage optimization purposes). In this example, the data on the shared storage system will be temporarily corrupt between the completion of the first and fourth flush operations, since the states of the data between these flush operations will reflect storage states that would never occur if the writes were flushed in the original order.
Data corruption can also occur in a more permanent manner if another consumer attempts to access the data on the shared storage system without knowledge that unflushed data still exists in the write-back cache. For instance, a VM executing on another host system may attempt to write to the data, or the shared storage system itself may attempt to backup or replicate the data, before the write-back cache is fully flushed. This in turn, may cause certain writes to be incorrectly overwritten, or the data to be captured in an inconsistent state. The latter situation is particularly problematic for high-end storage systems such as enterprise-class storage arrays, since a large part of their value proposition over lower cost storage devices is their ability to independently perform storage management functions. If such an array cannot tell whether its stored data is consistent or inconsistent when a connected host system is performing write-back caching, the array cannot create backups, perform replications, or the like without potentially introducing data corruption, which negates a significant portion of its utility/value.