Databases are used to store information and relations between pieces of information. Some databases are stored in a distributed file system on multiple computers connected by a network or multiple interconnected networks. Typically, the database is divided into partitions of equal, similar or dissimilar sizes and one or more partitions of informational data is stored in memory devices associated with multiple servers connected by a system of networks.
In some existing distributed file systems, such as a typical Hadoop distributed file system (HDFS) or a typical peer-to-peer (P2P) distributed file system, individual database files are divided into data blocks, which are distributed to multiple computer nodes in order to maintain data redundancy and achieve workload balance. Some existing distributed file systems utilize a relatively high number of nodes, including thousands of nodes, which varies over time as additional nodes are dynamically added to and dropped from the cluster. The computing nodes in these systems typically consist of commodity hardware, such as personal computers (PCs), and may include individual user systems. The existing approaches can achieve significant data redundancy and relatively high availability of data for throughput-oriented, large-scale, and loose-coupled distributed systems, such as HDFS and P2P distributed file systems.
Other existing distributed file systems, including massively parallel processing (MPP) database management systems, scale out by distributing data to a cluster or multiple clusters of relatively high-performance servers, such as data warehouse appliances, as well as by running individual computing transactions in parallel on multiple servers. MPP database systems typically include fewer nodes than HDFS or P2P distributed file systems, and the number of nodes in an MPP database system generally is relatively stable compared to HDFS or P2P. Thus, at times MPP database performance is dominated by slower individual nodes in the cluster of parallel processing computers. As a result, workload balance sometimes becomes the key factor in achieving both high throughput and low latency. In high-performance, server-based distributed file systems, even distribution of the data among the distributed server nodes generally can balance the workload, assuming the data on the individual nodes receives similar access to computing resources.
A typical MPP database cluster uses a “shared-nothing” storage architecture, wherein storage hardware, such as disk drives, is not shared between individual nodes in the cluster. In an MPP database system, data location information is unavailable at runtime. In addition, MPP database clusters increasingly utilize commodity hardware resources, such as PCs. Thus, if an individual server in the cluster should fail, the data stored on the failed server can become unavailable to the cluster. The design of high-availability MPP database systems requires that the cluster be capable of tolerating individual server failures while supporting continuous service without data loss. Data replication in multiple servers has been combined with failover procedures that can be conducted upon an individual server failure in order to achieve a general level of tolerance for server failures.
Furthermore, particularly with regard to node failure and during redundancy restoration, workload balance generally has not been given high design priority. For example, consistent hashing algorithms are often used to locate distributed and duplicated data in peer-to-peer distributed file systems. In consistent hashing algorithms the output range of a hash function is treated as a fixed circular space or ring (for example, the largest hash value wraps around to the smallest hash value). Each system node is assigned a value within this space, representing the node's position on the ring. Each data item is assigned to one of the nodes in the ring by hashing the identification or key associated with the data item to yield unique position on the ring, and then “walking” or moving clockwise around the ring to find the first server with a position larger than that of the item.
That is to say, each node becomes responsible for the region of the ring between the node and its predecessor node on the ring. Then, when a node fails, the corresponding workload is forwarded to the next server on the ring. Similarly, when a new node is added to the ring the node shares only the workload of the neighboring server in the ring, resulting in an unbalanced workload. While this sort of unbalanced workload may be reasonably well tolerated in a cluster that includes thousands of nodes, the same level of workload unbalance may have a considerably negative effect on performance in an MPP database. As a result, some existing file distribution methodologies can have drawbacks when used in MPP database systems, since both throughput and latency are of relatively high importance.