In data center environments, host-side flash caching (HFC) is becoming an increasingly popular technique for optimizing virtual machine (VM) access to data residing on shared storage. Generally speaking, an HFC-enabled host system caches, in a portion of a local flash storage device referred to as a “flash cache,” data that its VMs access from a shared storage device (e.g., a networked storage array). When the host system detects a VM read request for data that is already available in the flash cache, the host system retrieves the data directly from the local flash storage device rather than performing a roundtrip to/from the shared storage device, thereby improving VM read performance.
One aspect of managing a host-side flash cache involves determining how to handle VM write requests. Some HFC implementations employ a write-through approach in which the host system saves data for a write request synchronously in both the local flash storage device and the shared storage device. Once the data is committed in both locations, the host system returns an acknowledgment to the originating VM indicating write completion. This approach ensures data consistency and durability in the face of host system crashes, but does not leverage the speed/locality of the flash storage device to improve VM write performance.
Other HFC implementations employ a write-back approach in which the host system initially saves data for a write request solely in the local flash storage device; the host system does not perform a synchronous save in the shared storage device. Once the data is committed in the local flash storage device, the host system immediately returns an acknowledgement to the originating VM. At a later point in time, the host system flushes the data (considered “dirty data”) from the local flash storage device to the shared storage device, thereby completing the actual write process. This approach offers significantly lower VM write latency that write-through flash caching since the VM can proceed with its processing as soon as the host system completes its save in the local flash storage device. However, write-back flash caching may result in data corruption and/or data loss in the shared storage device if, e.g., the host system unexpectedly fails before all of the dirty data in the flash cache can be flushed.
To address some of the issues with write-back flash caching, it is possible to implement a RAID (Redundant Array of Inexpensive Disks) mirroring scheme such as RAID 1 in software. Such a scheme replicates the contents of the flash cache on one host system (i.e., the “primary” host system) to another host system (i.e., the “secondary” host system). Since the secondary host system maintains a backup copy of the flash cache, the likelihood of data corruption and/or data loss when the primary host system fails can be reduced or eliminated.
However, implementing RAID for write-back flash cache replication has its own inefficiencies and disadvantages. In particular, RAID generally requires that the primary host system perform the following sequential steps upon receiving a VM write request: (1) the primary host system saves the data for a VM write request in its local flash storage device; (2) the primary host system transmits the data to the secondary host system for replication; (3) the second host system saves a copy of the data in its local flash storage device and transmits a write completion message to the primary system; (4) the primary host system confirms that the data has been committed on the secondary host-side by receiving the write completion message transmitted at step (4); and (5) the primary host system returns an acknowledgement to the VM that originated the write request. Thus, the total latency for processing the write request (from the perspective of the VM) is the sum of the I/O latency for saving the data in the primary host-side flash storage device, the I/O latency for saving the data in the secondary host-side flash storage device, and the network latency for at least one roundtrip between the primary host system and the secondary host system. Although steps (1) and (2) can be done in parallel if 2 phase commit is used, steps (2) through (4) typically take at least an order of magnitude longer than step (1) when flash storage devices are used and an extra network roundtrip will be needed if 2-phase commit is used.
Contrast the above with a write-through implementation in which the shared storage device is an enterprise-class storage array. In this case, the host system saves the data for a VM write request synchronously in both its local flash storage device and the enterprise-class storage array. As part of the latter step, the enterprise-class storage array will typically perform storage-level caching to eliminate any disk I/O latency (by, e.g., caching the write data in NVRAM or a storage-side solid-state disk (SSD)) and immediately return a write completion message to the host system. Upon receiving the write completion message from the storage array (and verifying completion of the save to the local flash storage device), the host system returns an acknowledgement to the originating VM. Thus, the total latency for processing the write request in this scenario is the sum of the I/O latency for saving the data in the local flash storage device and the network latency for one round trip between the host system and the storage array, which is potentially less than the total write latency in the “write-back+RAID” scenario. This means that combining write-back flash caching with software RAID mirroring may not provide any better performance (and in some situations, may provide worse performance) than write-through flash caching with enterprise-class storage.