Computer based structured storage systems, such as computer file systems and database systems, have been remarkably successful at providing users with quick and facile access to enormous amounts of data. The importance of these structured storage systems in today's commerce is difficult to exaggerate. For example, structured storage systems have allowed businesses to generate and maintain enormous stores of persistent data that the company can modify and update over the course of years. For many companies, this persistent data is a valuable capital asset that is employed each day to perform the company's core operations. The data can be, for example, computer files (e.g., source code, wordprocessing documents, etc.), database records and information (e.g., information on employees, customers, and/or products), and/or Web pages. Such data must be "highly available," i.e., the data must be available despite system hardware or software failures, because it is often used for day-to-day decision making processes.
Previous efforts to provide high availability or fault tolerance have included both hardware techniques, such as providing redundant systems, and software approaches, such as redundant array of independent disks (RAID) technology and clustering. Each one of these efforts has its own unique drawbacks.
Redundant systems are typified by double or triple redundancy. These types of systems provide more than one complete machine to accomplish the task of one machine. Each machine performs the same operations in parallel. If one machine fails or encounters an error, the additional machines provide the correct result. Such systems, while highly tolerant of system faults, are extremely expensive. In effect, multiple networks of machines must be provided to implement each network.
A similar fault-tolerant approach for storage is RAID. RAID technology may be implemented as disk mirroring (so-called RAID I) or disk striping with parity (so-called RAID V). Disk mirroring provides highly fault tolerant storage but is expensive, since multiple disks, usually two, must be provided to store the data of one disk. Disk striping with parity has poor performance for intensive write applications, since each time data is written to the array of disks a parity block must be calculated. Disk striping provides rigid N+1 redundancy and suffers additional performance degradation after the first error since the missing block (or blocks) must be recalculated each time a read operation is performed. Finally, such rigid N+1 redundancy schemes have no way of "healing" themselves, that is, after one error the system in no longer N+1 redundant.
Other software approaches to improve the reliability and operation of centralized structured storage network systems have generally involved: (1) static mapping of the data to one or more servers and associated disks (sometimes referred to as "shared nothing" clustering); (2) storing the data in shared data repository, such as a shared disk (sometimes referred to as "shared everything" clustering); and (3) database replication.
Systems using the first method distribute portions of the data store across a plurality of servers and associated disks. Each of the servers maintains a portion of the structured store of data, as well as optionally maintaining an associated portion of a directory structure that describes the portions of the data stored within that particular server. These systems guard against a loss of data by distributing the storage of data statically across a plurality of servers such that the failure of any one server will result in a loss of only a portion of the overall data. However, although known clustered database technology can provide more fault tolerant operation in that it guards against data loss and provides support for dual-path disks, the known systems still rely on static allocation of the data across various servers. Since data is not dynamically allocated between servers: (1) system resources are not allocated based on system usage which results in under utilization of those resources; (2) scaleable performance is limited because new servers must be provided whenever the dataset grows or whenever one particular server cannot service requests made to its portion of the dataset; and (3) such static allocation still requires at least one of servers storing the information to survive in order to preserve the data. Also, failure of one server requires a second server to serve the data previously served by the down server, which degrades system performance.
Systems using the second method store the data stored in a shared data repository, such as a shared disk. The shared disks may be shared between a subset of system nodes or between all nodes of the system. Each node in the system continually updates the central data repository with its portion of the structured store. For example, in a database system, each node exports tables it is currently using to the data store. While this method exports the problems of load balancing to the central data repository, it suffers from two main drawbacks. First, throughput is lowered because of increased overhead associated with ensuring coherency of the centralized data store. Second, locking is inefficient because entire pages are locked when a node accesses any portion of a page. As a result, nodes may experience contention for memory even when no true conflict exists.
Similar to disk mirroring, but at a higher level, are techniques based on database replication. These systems may provide replication of the data stores or of the transactions performed on the data stores. Accordingly, these systems go further in guarding against the loss of data by providing static redundancy within the structured storage system. However, such systems suffer from the same drawbacks as other static techniques described above. Additionally, so-called "transaction-safe" replication techniques suffer from scalability problems as the number of tables served increases.