1. Field of the Invention
The present invention relates to a shared-nothing distributed storage system that consists of a cluster of independent computer systems (nodes) connected by a network.
2. Description of the Related Art
The data stored in a storage system must be reliably maintained and input/output (I/O) processing must not be interrupted during migration of the data in the storage system. For a write operation, for example, during migration, the storage system must reliably track the data object state.
One known method employs write marking where a region of data that is to be modified is marked, for example with a “dirty” flag, on a common/shared “scoreboard” before writing the data. In this approach, several steps are required which include logging a request to write data, sending a message to each target storing the data, waiting for a write and a response, then sending the actual write operation. The foregoing method leads to increases in network latency for write operations.
Another known storage method marks an entire high level data storage area as dirty. However, such an approach is not viable with respect to large amounts of data because it requires recovery of the entire large aggregations of data. Known storage systems may also mark a file as dirty at the file system level to indicate a modification. However, marking at the file system level results in the marked data having a granularity that is too coarse to be effective for very large data files, which results in recoveries which require too long a period of time to complete. Still further, marking a chunk of data as dirty in a centralized database is also known in the art, such as in Parascale Inc.'s scale-out storage platform software.
Similar functions in known storage systems further include the Fast Mirror Resync (FMR) feature of VERITAS Volume Manager (VxVM), which is described in U.S. Pat. Nos. 6,907,507, 6,910,111, 7,089,385 and 6,978,354, for example, which are incorporated herein by reference. U.S. Pat. Nos. 6,907,507, 6,910,111, 7,089,385 and 6,978,354 use multi-column bitmaps, accumulator maps and per-mirror maps. With respect to recovery from I/O errors, storage systems of the prior art (volume managers and multi-copy file systems) require a central manager to either perform recovery by directly reading or writing data, or require a coordinator to manage the recovery process. A drawback of such a configuration is that centrally-managed recoveries stall when the coordinator undergoes a failure, which leads to further complications in the recovery process. Additionally, to account for the possibility of coordinator failure, large amounts of metadata are required to be reliably maintained in a shared storage.
In cases of partially written data recovery, the prior art consists of mirror reconnection and mirror “resilvering” approaches taken by many volume manager implementations, which use a central database or volume-level bitmap of some sort. Other implementations use a central recovery manager that does direct reads and writes from one central location (all volume managers) or have a central coordinator to drive the recovery (as in Parascale Inc.'s scale-out storage platform software, for example).
In cases involving the migration of data, where a node or a disk thereof is added or removed in a storage system, the prior art includes the CEPH file system relayout feature, which is based on reliable hashes and map generations. Both PICASSO and CEPH systems use a placement algorithm commonly known as the “CRUSH” algorithm to deterministically calculate the proper placement of data chunks based on version information of the storage configuration across an entire storage cluster. See Sage A. Weil; Scott A. Brandt; Ethan L. Miller; Carlos Maltzahn; “CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data,” Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, p. 31, 2006, which is incorporated herein by reference. In the CEPH system, relayout is performed by a central metadata engine. Further, in the Parascale system, data relayout is driven by a central database and placement is done in an ad-hoc, per-chunk manner. When relayout in the Parascale system is interrupted, the data layout is left in a transitional but consistent state and upon resuming of the relayout process, data placement is recalculated.
In cases where a policy change is made to data redundancy, data movement is centrally administered and is performed from a central management node. The Parascale Inc.'s system had a central administration of migration, where locations determined as new data locations are required to “pull” data from existing storage locations to satisfy the change in redundancy.