Distributed computing and distributed algorithms have become prevalent in a wide variety of contexts, for reasons of increased performance and load capacity, high availability and failover and faster access to data. Distributed computing typically involves a number of autonomous computers (also called nodes) in communication with one another to solve a task, such as execute an application, solve complex computational problems or provide users access to various services. Each of the computer nodes typically includes its own processor(s), memory and a communication link to other nodes. The computers can be located within a particular location (e.g. cluster network) or can be connected over a large area network (LAN) such as the Internet. In most cases, distributed computers use messages to communicate with one another and to coordinate task processing and data management.
Data management is a significant issue in distributed computing. In the context of a cluster network, large data sets can be partitioned among the various nodes of the cluster. Each node usually stores a number of such partitions (subparts of the entire data set) and performs transactions on the partitions. In many cases, partitions include primary and backup copies of data distributed among the members for purposes of failover. The distribution of data in this partitioned manner can improve manageability, performance and availability of information.
There exist a number of constraints and concerns that influence the ideal distribution of data within this context. For example, moving data from one server to another incurs expenses of time and/or processor capacity. For high availability reasons, it is often preferable to locate the primary and backup copy of data on physically distinct machines. Additionally, for performance, scalability, and capacity reasons, it is often preferable to balance the distribution of data somewhat equally among available storage servers and to adjust that distribution when nodes are added or removed from the cluster.
In some use-cases, further application-specific preferences may be desired. For example, specifying that a particular set of data should be located on a specific server can be useful under certain circumstances. Furthermore, it may be desirable to specify that the distribution should use runtime feedback and inputs to place data among the node members. In light of the foregoing, what is needed is a simple and efficient way to address all or many of data distribution concerns and to optimize the balancing of partitions among the distributed cluster members.