A distributed storage system includes nodes coupled by network links. The nodes store data objects, which are accessed by clients. By storing replicas of the data objects on a local node or a nearby node, a client can access the data objects in a relatively short time. An example of a distributed storage system is the Internet. According to one use, Internet users access web pages from web sites. By maintaining replicas on nodes near groups of the Internet users, access time for the Internet users is improved and network traffic is reduced.
Replicas of data objects are placed onto nodes of a distributed storage system using a data placement heuristic. The data placement heuristic attempts to find a near optimal solution for placing the replicas onto the nodes but does so without an assurance that the near optimal solution will be found. Broadly, data placement heuristics can be categorized as caching techniques or replication techniques. A node employing a caching technique keeps replicas of data objects accessed by the node. Variations of the caching technique include LRU (least recently used) caching and FIFO (first in first out) caching. A node employing LRU caching adds a new data object upon access by the node. To make room for the new data object, the node discards a data object that was most recently accessed at a time earlier than other data objects stored on the node. A node employing FIFO caching also adds a new data object upon access by the node but it discards a data object based upon load time rather than access time.
The replication techniques seek to make placement decisions about replicas of data objects typically in a more centralized manner than the caching techniques. For example, in a completely centralized replication technique, a single node of the distributed storage system decides where to place replicas of data objects for all data objects and nodes in the distributed storage system.
Currently, a system designer or system administrator seeking to deploy a placement heuristic in order to place replicas of data objects within a distributed storage system will choose a data placement heuristic in an ad-hoc manner. That is, the system designer or administrator will choose a particular data placement heuristic based upon intuition and past experience but without assurance that the data placement heuristic will perform adequately.
What is needed is a method of selecting a data placement heuristic with an expectation that the data placement heuristic will perform adequately and result in a low replication cost.