Distributed storage systems for a large number of immutable objects, perhaps on the order of billions of objects, may be traditionally implemented on fault-tolerant systems designed to provide a high-level of system reliability and availability. Such systems may be implemented with fixed placement of replicated objects and replicated index servers. Upon recognizing when a node or disk may be lost, the system may need to repair the object, for example, by striving to reestablish the desired number of replicas for that object in the system.
Prior schemes, as a result, had some limitation on where objects could be placed. Previous systems usually put object replicas into a few predetermined locations. There are several known problems with this approach. First of all, such an approach does not allow flexible use of available space on the storage nodes in the distributed system. The situation may arise where a particular object has been predetermined to be placed on a storage node that happens to be full. As a result, maintenance tasks may be required to be performed in order to create space on the storage node, such as splitting up the node, or using another node. Another problem with this approach is that it may prevent performing optimizations such as co-locating objects. For instance, it may be advantageous to reduce access latency by placing objects on the same storage node that may be frequently accessed together. However, this may not be allowed by a system implemented with fixed placement of replicated objects.
Furthermore, there are other problems that may arise with schemes limiting placement of object replicas to a few predetermined locations, for instance, when repairing an object with a replica lost on a failed storage node. Because the scheme may force objects to be placed on particular nodes and all but one of the particular storage nodes may be failed, there may be little choice left where to place the new copies for the objects that replace those lost by the node failures. If all the copies of objects must be placed on a single remaining node, there may be limitations of either the network bandwidth or the disk bandwidth for how fast the new objects may be written to this single remaining node.
There are similar problems with a fixed placement scheme when a storage node fails. Only a few other storage nodes may participate in repairing the object by copying the replicas to other storage nodes. As a consequence, object repair may take a long time due to disk or network bandwidth limitations, and further crashes during this vulnerability window may cause a loss of objects. Moreover, a fixed placement scheme cannot support more flexible placement policies required by an application. For example, an application may have a better knowledge of the semantics of the objects and may prefer to place certain object replicas together on certain disk storage nodes to reduce access latency.
Another limitation to previous schemes is that all or part of an index residing on a node may be lost when the node fails. Previous schemes may replicate the indices so that when an index node fails, there may be a replica that may be used to continue to locate objects. Also previous schemes have tried to persist the indices on disk so that during transient failures such as node reboots, an index may be restored from disk after the node is operable. Because indices may be concurrently read and written, the indices are typically replicated also. In general, such schemes may require complicated fault-tolerant approaches to maintain the indexes and may incur runtime overhead for maintaining replicated indexes or persistent indexes.
What is needed is a way for providing a high-level of availability and reliability for a distributed object store without resorting to fault-tolerant platforms with fixed placement of replicated objects and replicated index servers. Any such system and method should allow flexible placement of replicated objects and indexes while reducing the time and expense for maintaining the system.