1. Technical Field
The present invention relates generally to techniques for highly available, reliable, and persistent data storage in a distributed computer network.
2. Description of the Related Art
A need has developed for the archival storage of “fixed content” in a highly available, reliable and persistent manner that replaces or supplements traditional tape and optical storage solutions. The term “fixed content” typically refers to any type of digital information that is expected to be retained without change for reference or other purposes. Examples of such fixed content include, among many others, e-mail, documents, diagnostic images, check images, voice recordings, film and video, and the like. The traditional Redundant Array of Independent Nodes (RAIN) storage approach has emerged as the architecture of choice for creating large online archives for the storage of such fixed content information assets. By allowing nodes to join and exit from a cluster as needed, RAIN architectures insulate a storage cluster from the failure of one or more nodes. By replicating data on multiple nodes, RAIN-type archives can automatically compensate for node failure or removal. Typically, RAIN systems are largely delivered as hardware appliances designed from identical components within a closed system.
FIG. 1 illustrates one such scalable disk-based archival storage management system. The nodes may comprise different hardware and thus may be considered “heterogeneous.” A node typically has access to one or more storage disks, which may be actual physical storage disks, or virtual storage disks, as in a storage area network (SAN). The archive cluster application (and, optionally, the underlying operating system on which that application executes) that is supported on each node may be the same or substantially the same. The software stack (which may include the operating system) on each node is symmetric, whereas the hardware may be heterogeneous. Using the system, as illustrated in FIG. 1, enterprises can create permanent storage for many different types of fixed content information such as documents, e-mail, satellite images, diagnostic images, check images, voice recordings, video, and the like, among others. These types are merely illustrative, of course. High levels of reliability are achieved by replicating data on independent servers, or so-called storage nodes. Preferably, each node is symmetric with its peers. Thus, because preferably any given node can perform all functions, the failure of any one node has little impact on the archive's availability.
As described in commonly-owned U.S. Pat. No. 7,155,466, it is known in a RAIN-based archival system to incorporate a distributed software application executed on each node that captures, preserves, manages, and retrieves digital assets. FIG. 2 illustrates one such system. A physical boundary of an individual archive is referred to as a cluster. Typically, a cluster is not a single device, but rather a collection of devices. Devices may be homogeneous or heterogeneous. A typical device is a computer or machine running an operating system such as Linux. Clusters of Linux-based systems hosted on commodity hardware provide an archive that can be scaled from a few storage node servers to many nodes that store thousands of terabytes of data. This architecture ensures that storage capacity can always keep pace with an organization's increasing archive requirements.
In storage systems such as described above, data typically is distributed across the cluster randomly so that the archive is always protected from device failure. If a disk or node fails, the cluster automatically fails over to other nodes in the cluster that maintain replicas of the same data. While this approach works well from a data protection standpoint, a calculated mean time to data loss (MTDL) for the cluster may not be as high as desired. In particular, MTDL typically represents a calculated amount of time before the archive will lose data. In a digital archive, any data loss is undesirable, but due to the nature of hardware and software components, there is always a possibility (however remote) of such an occurrence. Because of the random distribution of objects and their copies within an archive cluster. MTDL may end being lower than required because, for example, a needed copy of an object may be unavailable it a given disk (on which a mirror copy is stored) within a given node fails unexpectedly.