U.S. patent application Ser. No. 10/993,536, assigned to the assignee of the present invention, describes a distributed storage system for a large amount of immutable objects, possibly on the order of billions of objects. Such systems may be implemented with replicated objects and replicated index servers. These replicas as maintained in a flat namespace, referenced by a globally unique identifier (GUID) and in general have no locking semantics. In general, there are at least two replicas for each object, but more typically are three or four replicas for each object, depending on the owner's reliability policy.
To implement such a large storage system, the system combines many storage units, referred to as bricks (or nodes), where in general each brick has a processor (CPU), memory, and one or more disks for storage. In a large brick storage system, individual disk or brick failures are relatively frequent. To tolerate these failures, each object has multiple replicas placed among different bricks in the system, such that even if some replicas are not available due to disk or brick failures, others can still be accessed. Moreover, when a replica is lost, a new replica needs to be created on another brick that is different from the bricks that contain the remaining replicas. This is to keep the replication degree and maintain the reliability of the object. The process of copying replicas to newly selected bricks when a brick fails is called data repair. The brick from which the replica is copied is referred to as the repair source, and the new brick to which the replica is copied is referred to as the repair destination.
To facilitate data repair, it is desirable that data repair can be done in parallel by many bricks. For example, consider a brick that contains 200 GB of data. If that brick fails, and only one other brick acts as the repair source or destination during the copy of all replicas on the failed disk, it will take about 2.8 hours to complete the repair, given a disk bandwidth of around 20 MB per second. However, if 200 bricks are involved in repairing the 200 GB of data, in parallel (with 100 bricks as the repair sources and 100 bricks as the repair destinations), data repair of 200 GB disk can be done in 100 seconds. As can be readily appreciated, such fast parallel repair significantly reduces the window of data vulnerability, and thus fast repair is desirable to reduce system data loss and improve system reliability.
One way to achieve fast parallel repair is to place object replicas randomly among the bricks in the system, while ensuring that no one brick contains multiple copies of the same replica. In this scenario, when a brick fails, many other bricks in the system contain the remaining replicas that were hosted on the failed brick, so they can act as the repair sources, initiate repair by randomly selecting other bricks as the destinations, and start the data repair process mostly in parallel.
However, a pure random placement policy to facilitate fast repair is in conflict with the concept of load balancing. More particularly, as the system evolves and old, failed bricks are replaced by new bricks, newly added bricks will be much less loaded than the bricks that have been running in the system for a long time. If the loads are imbalanced, low-load bricks are not fully utilized, while high-load bricks receive most access requests and thus the overall system performance is reduced.
To address the load balancing issue, a placement policy may prefer low-load bricks over high-load bricks when placing new object replicas in the system. However, if not carefully designed, such load balancing policy may go against fast parallel repair. For example, if there are five bricks that have a relatively very small load when compared against the remaining bricks, and all or most new objects being checked in are put among these five bricks for load balancing purposes, then when one of the five bricks fails, the remaining four bricks need to perform most of the data repair task, whereby data repair can take a very long time.