1. Field of the Invention
The present invention relates to a system for generating parity files in a distributed data structure and, in particular for providing a highly scalable and highly available data structure.
2. Description of the Related Art
The trend in data and file storage is to use distributed data structure systems. In network systems including client and server machines, files may be distributed over many machines in the system. In this way, a pool of computer sites attached to the network provides added power and resources. One recent development is to use the RAM in computers throughout the network system instead of local hard disk drive space. A network system comprised of numerous sites (processors) having megabytes (MB) of RAM per site can provide a distributed RAM space capable of storing files having a size in gigabytes (GB). In network systems, client and server machines function as nodes of a network. Each server (or client) provides storage space within its local RAM or local hard disk drive space to store objects comprising a file. The storage space each machine provides for this network file is called a "bucket." The number of interconnected servers in a system can extend from 10 to 100,000. The file consists of records, i.e., objects, that are identified by primary keys (c). A record R with a key is denoted as R(c), whereas c refers to the key value.
One problem with such distributed data structure systems is how to determine the optimal number of sites to utilize to store the distributed file. The use of too many sites may deteriorate system performance. Moreover, the optimal number of sites is often unknown in advance or can change as the file size changes. The goals of a distributed data structure include: (1) providing flexibility such that new servers can be added to the distributed system as a file expands; (2) no centralized site that must process and manage all computations; (3) no need to provide updates to multiple nodes in the system to primitive commands, e.g., search, insertion, split, etc. A distributed system that satisfies the above three constraints is known as a Scalable Distributed Data Structure (SDDS).
On such prior art SDDS is the Linear Hashing system, also known as LH*, described in detail in "LH*--A Scalable, Distributed Data Structure," by Witold Litwin, Marie-Anne Neimat, and Donovan A. Schneider, published in ACM Transactions on Database Systems, Vol. 21, No. 4, December 1996, pgs. 480-525, which is incorporated herein by reference in its entirety. An LH* file is stored across multiple buckets comprising local disk drive space or RAM. The LH* is a hashing function that assigns a bucket address to a key c added to the file by applying the hashing function to the key c value.
Each bucket has a predetermined record limit b. When a bucket reaches such predetermined maximum size, a bucket is added to the system and the contents of the full bucket are split into this new bucket. Every split moves about half of the records in a bucket to a new bucket at a new server (site). The bucket to be split is denoted by a pointer referred to as n, the split pointer. Buckets are split sequentially, where the split pointer maintains the last bucket split. The file level q is the file level value that indicates the number of splitting sequences that have occurred. The file level q is used to determine the number of buckets, 2q-1, at any given level. For instance, when there is only 1 bucket, q=0. When a new bucket is created, q increments by one, which in the present case increments to q=1. The pointer n is then set back to bucket 0. Bucket 0 is split, then bucket 1 is split. When bucket number 2q-1 bucket is split, which in this case is bucket 1, there are then four total buckets (0, 1, 2, 3) and q is incremented to two. This process of cycling through the current number of buckets to split buckets is described in "LH*--A Scalable, Distributed Data Structure," by Witold Litwin et al., incorporated by reference above.
When a bucket overflows, the client or server maintaining the bucket reports the overflow to a dedicated node called a coordinator. The coordinator applies a bucket load control policy to determine whether a split should occur.
When a record c is added to the file F, a directoryless pair of hashing functions h.sub.q and h.sub.q+1, wherein q=0, 1, 2, are applied to the record c to determine the bucket address where the record c will be maintained. The function h.sub.q hashes a key (c) for the data record, which is typically the primary key. The value of the split pointer n is used to determine which hashing function, h.sub.q or h.sub.q+1 should be applied to the key c to compute a bucket address for the record c. If the coordinator determines that a split should occur, the coordinator signals a client in the system to perform the split calculations. The client uses the hash functions to hash a key c into a bucket address.
Traditional LH approaches assume that all address computations use correct values for q and n. This assumption cannot be satisfied if multiple clients are used unless a master site updates all clients with the correct values for q and n. LH* algorithms do not require that all clients have a consistent view of q and n. In LH*, each client has its own view of q and n, q' and n'. These values are only updated after a client performs an operation. Because each client has a different view of q and n, each client could calculate a different address for a record c. In LH*, the client forwards the record c to the server based on the result of the hashing function and the values of q' and n' maintained by the client.
The server, where the bucket resides, that receives the record c from the client applies the hashing functions using the values of q' and n' at the server to determine if the bucket address for the key c is the address of the target server. If the client sent the key c to the correct bucket, the key c is stored in the bucket. If not, the server calculates the new address and forwards the key c to the correct server (bucket). The recipient of the key checks the address again, and may resend the key c to another bucket if the initial target server (bucket) calculated the wrong address using its values n' and q'. Under current LH* algorithms, records are forwarded to a correct new bucket address in at most two forwardings. When the correct server gets the records for the new bucket, the correct server sends an image adjustment message (IAM) to the client, and any intermediary servers using incorrect values of n and q, to allow the client (or server) to adjust its values of n' and q' closer to the correct image of q and n.
Such LH* schemes typically guarantee that all data remain available, i.e., recoverable, as long as no more than s sites (buckets) fail. The value of s is a parameter chosen at file creation. Such s availability schemes suffice for static files. However, one problem with such prior art static schemes is that such schemes do not work sufficiently for dynamic files where the size of the file is scalable. For instance, at given value of s, i.e., the system remains available if no more than s buckets fail, as the file increases in size, the file reliability, i.e., probability that no data is lost, decreases.