The present invention relates generally to systems and methods for providing recovery of the contents of a data storage system after a failure or other potential source of data loss or corruption, and more specifically to systems and methods for providing write order fidelity for a storage system having data writers operating concurrently in multiple locations across a distributed data storage network.
In current storage networks, and in particular storage networks including geographically separated access nodes and storage resources interconnected by a network, write performance can be severely hampered as distance between nodes increases if writes must be replicated or transmitted synchronously. Additionally, minimizing required bandwidth between locations is highly desirable. Thus, methods of asynchronously transmitting data are used where the write is acknowledged before the data is transferred to nodes at remote sites.
It is also desirable that data access be localized, in part to improve access speed to blocks of data requested by host devices. Caching blocks at access nodes provides localization, however, the cached data must be kept coherent with respect to modifications at other access nodes that may be caching the same data.
Further, such complex storage applications need to withstand the failure of their backing storage systems, of local storage networks, of the network interconnecting nodes, and of the access nodes. Should a failure occur, asynchronous data transmission implies the potential for the loss of data held at the failed site. A consistent data image, from the perspective of the application, needs to be constructed from the surviving storage contents. An application must make some assumptions about which writes, or pieces of data to be written, to the storage system have survived the storage system failure; specifically, that for all writes acknowledged by the storage system as having been completed, that the ordering of writes is maintained such that if a modification due to a write to a given block is lost, then all subsequent writes to blocks in the volume or related volumes of blocks is also lost.
The term write order fidelity (“WOF”) as used herein refers to a group of related properties, each of which describes the contents of a storage system after recovery from some type of failure. That is, after the storage system recovers from a failure, properties that the application can assume about the contents of the storage system. Write Order Fidelity (WOF) introduces a guarantee that, after recovery from a failure, surviving data will be consistent. Complex applications such as file systems or databases rely on this consistency property to recover after a failure of the storage system. Even simpler applications that are not explicitly written to recover from their own failure or the failure of backend storage should benefit from these post-failure guarantees.
When implementing WOF in a strict sense, an application will generate a stream of writes {Wi|i≧1} to the storage system supporting that application. The underlying storage system exhibits strict write order fidelity if, after any failure of the storage system, the state of the storage system upon recovery reflects some prefix of the write sequence from the application. In other words, there exists some i≧0 such that all of writes {Wj|j≦i} have been committed to storage, and none of writes {Wj|j>i} have been committed to storage.
Strict WOF assumes that writes can be totally ordered, which is straightforward for a single controller or for a set of tightly-coupled storage controllers communicating through shared memory. The costs of generating such a total order on writes, however, become significant for controllers communicating via messages passing even within a site. The ordering costs become unacceptable as inter-controller latencies reach even a few milliseconds.
Traditionally, an “active-passive” approach is used for asynchronous transmission of data between sites such that only one writer, or host processor, has read-write access to a given volume of blocks, and other processors only have read access. An environment which is “totally-active”, where read and writes to a given volume of blocks can occur randomly from any node is highly desirable, but requires changes in the approach to WOF and how WOF interacts with caching at all access nodes in the system.