Data distribution, otherwise known as data deployment, data logistics, or data replication, includes the placement and maintenance of replicated data at multiple data sites across a network. Historically, data distribution has been either point-to-point, i.e., communication from one location to another, or multipoint, i.e., communication from one location to many. However, such data distribution has many drawbacks. For example, if multiple clients simultaneously request the same file from the same server, the server may become overloaded and no longer be able to respond efficiently to normal requests. This is commonly known as denial of service.
Clients and servers may be widely distributed from one another. Therefore, communication between the clients and server may consume valuable system resources, where system resources are the components that provide the network's inherent capabilities and contribute to its overall performance. System resources include routers, switches, dedicated digital circuits, bandwidth, memory, hard disk space, etc.
Still further, distributing data between widely dispersed data sites is often unreliable, as the further the distance between data sites the higher the probability of delays, packet loss, and system malfunction. Such data distribution between widely dispersed data sites is also typically slow due to the large distances the data, and any acknowledgements of the receipt of such data, must travel.
The above-mentioned drawbacks are compounded when large volumes of data, such as terabytes, are to be transferred between dispersed data sites.
Additionally, as the number of machines and data sites increase within a network, scalability becomes an issue. For example, many current data distribution systems require some form of centralized control. As such networks grow, the centralized control must handle more and more requests. The centralized control unit can become overwhelmed with requests and may become a bottleneck for the entire network. Additionally, the network may become vulnerable to inoperability due to failure of the centralized control unit. As a result, centralized control becomes an increasing liability as the network grows.
Some mechanisms have been developed in an attempt to address the scalability issue, including various public domain peer-to-peer distribution systems However, these systems are not optimal, as they do not account for global resource constraints when scheduling data transfer operations. Ignorance of global resource constraints can lead to decreased aggregate throughput, due to collisions and packet drops within the network. Additionally, ignorance of global resource constraints also makes prioritization of file transfers more difficult.
Accordingly, a system and method for reliably distributing large amounts of data between widely dispersed data sites would be highly desirable. Furthermore, it would also be highly desirable if such a system is easily scalable.