1. Technical Field
The present invention relates generally to techniques for highly available, reliable, and persistent data storage in a distributed computer network.
2. Description of the Related Art
A need has developed for the archival storage of “fixed content” in a highly available, reliable and persistent manner that replaces or supplements traditional tape and optical storage solutions. The term “fixed content” typically refers to any type of digital information that is expected to be retained without change for reference or other purposes. Examples of such fixed content include, among many others, e-mail, documents, diagnostic images, check images, voice recordings, film and video, and the like. The traditional Redundant Array of Independent Nodes (RAIN) storage approach has emerged as the architecture of choice for creating large online archives for the storage of such fixed content information assets. By allowing nodes to join and exit from a cluster as needed, RAIN architectures insulate a storage cluster from the failure of one or more nodes. By replicating data on multiple nodes, RAIN-type archives can automatically compensate for node failure or removal. Typically, RAIN systems are largely delivered as hardware appliances designed from identical components within a closed system.
Known prior art archival storage systems typically store metadata for each file as well as its content. Metadata is a component of data that describes the data. Metadata typically describes the content, quality, condition, and other characteristics of the actual data being stored in the system. In the context of distributed storage, metadata about a file includes, for example, the name of the file, where pieces of the file are stored, the file's creation date, retention data, and the like. While reliable file storage is necessary to achieve storage system reliability and availability of files, the integrity of metadata also is an important part of the system. In the prior art, however, it has not been possible to distribute metadata across a distributed system of potentially unreliable nodes. The present invention addresses this need in the art.
An improved archival storage system is described in U.S. Pat. Nos. 7,155,466, 7,657,581 and 7,657,586, which are commonly-owned. This system provides a distributed object store across a distributed set of nodes. According to U.S. Pat. No. 7,657,581, an archival storage cluster of symmetric nodes includes a “metadata management” system that organizes and provides access to metadata, preferably in the form of metadata objects. Each metadata object has a unique name, and metadata objects are organized into regions. In one embodiment, a region is selected by hashing one or more object attributes (e.g., the object's name) and extracting a given number of bits of the resulting hash value. The number of bits may be controlled by a configuration parameter. In this scheme, each region is stored redundantly, and a region comprises a set of region copies. In particular, there is one authoritative copy of the region, and zero or more backup copies. As described, the number of copies may be controlled by a configuration parameter, sometimes referred to as a number of metadata protection levels (a “MDPL”). Thus, for example, in one embodiment of this scheme, a region comprises an authoritative region copy and its MDPL-1 backup copies. Region copies are distributed across the nodes of the cluster so as to balance the number of authoritative region copies per node, as well as the number of total region copies per node.
Another aspect of the above-described metadata manager system is referred to as a region “map” that identifies the node responsible for each copy of each region. The region map is accessible by the processes that comprise the metadata management system. A region in the region map represents a set of hash values, and the set of all regions covers all possible hash values. The regions are identified by a number, which is derived by extracting a number of bits of a hash value. A namespace partitioning scheme is used to define the regions in the region map and to control ownership of a given region. This partitioning scheme is implemented in a database. In the scheme, a region copy has one of three states: “authoritative,” “backup” and “incomplete.” If the region copy is authoritative, all requests to the region go to this copy, and there is one authoritative copy for each region. If the region copy is a backup (or an incomplete), the copy receives update requests (from an authoritative region manager process). A region copy is incomplete if metadata is being loaded but the copy is not yet synchronized (typically, with respect to the authoritative region copy). An incomplete region copy is not eligible for promotion to another state until synchronization is complete, at which point the copy becomes a backup copy.
Another aspect of the above-described metadata management scheme is that the backup region copy is kept synchronized with the authoritative region copy. Synchronization is guaranteed by enforcing a protocol or “contract” between an authoritative region copy and its MDPL-1 backup copies when an update request is being processed. For example, after committing an update locally, the authoritative region manager process issues an update request to each of its MDPL-1 backup copies (which, typically, are located on other nodes). Upon receipt of the update request, in this usual course, a region manager process associated with a given backup copy issues, or attempts to issue, an acknowledgement. The authoritative region manager process waits for acknowledgements from all of the MDPL-1 backup copies before providing an indication that the update has been successful. There are several ways, however, in which this update process can fail, e.g., the authoritative region manager (while waiting for the acknowledgement) may encounter an exception indicating that the backup manager process has died or, the backup manager process may fail to process the update request locally even though it has issued the acknowledgement or, the backup region manager process while issuing the acknowledgement may encounter an exception indicating that the authoritative region manager process has died, and so on. If the backup region manager cannot process the update, it removes itself from service. If either the backup region manager process or the authoritative manager process dies, a new region map is issued. By ensuring synchronization in this manner, each backup copy is a “hot standby” for the authoritative copy. Such a backup copy is eligible for promotion to being the authoritative copy, which may be needed if the authoritative region copy is lost, or because load balancing requirements dictate that the current authoritative region copy should be demoted (and some backup region copy promoted).
The above-described design is advantageous in that it ensures high availability of the metadata even upon a number of simultaneous node failures. When there is a node failure, one or more regions are lost, which the system then needs to recover. The recovery process involves creating region copies for the lost regions. The repair consumes hardware and database resources and thus has a performance cost. On large clusters, a repair can take many hours, and during that time the cluster may be below MDPL.