In a distributed data store, user data is stored in a persistent storage (e.g., MySQL). In a social network with many millions of users, user data is too big to store on a single database server, and is thus distributed into different shards. Each shard is stored in a logical database and a database server manages one or more shards. Access (e.g., read, write) to the persistent storage can be through cache servers. Each database server maps to a cache server and the two are tied together by a common database identifier (ID).
Hashing is used to map a resource to a cache server (and a logical database) so that a request for that resource gets routed to the cache server storing the resource. A hash function can distribute its input (e.g., the resources) “randomly” among the cache servers. Consequently, each cache server becomes responsible for some of the resources. This has a load balancing effect on the cache servers. However, there are several disadvantages of using this method of hashing. For example, when a new cache server is added or one of the cache servers fails, the hash function changes. This in turn results in remapping of the resources to different cache servers. For example, suppose cache server 3 becomes unavailable in a cluster of cache servers 1-5. Then a resource “abc” that was mapped to cache server 3 gets mapped to a different cache server. In fact, because the number of cache servers is now 4, all the resources get mapped to different cache servers. When a request for the resource “abc” is received by the new cache server, because the resource is cached at the old cache server, it will result in a cache “miss.” A cache miss occurs when the requested resource is not in the cache. The new cache server will have to make database queries to fetch that resource. Moreover, because of this remapping, the same resource is now cached at both the old cache server and the new cache server, resulting in data duplication. This increases inefficiency.
To overcome these disadvantages, consistent hashing can be used to consistently map each resource to a particular cache server. However, this type of mapping can lead to imbalance of load across the cache servers. For example, suppose a new cache server is added to a cluster of 24 cache servers. If the new cache server has better hardware and more capacity, that information is not taken into account by consistent hashing. Similarly, although the consistent hashing redistributes the resources more or less evenly among the cache servers, all resources are not equal. Some resources may be requested more often than others. Consequently, some of the cache servers will experience more load than others, leading to a problem of load imbalance (or skew) across the cache servers.