Data centers include data storage nodes for storing data. For example, a data center for a web-based application such as email may employ a number of data storage nodes that are geographically distributed. Each location may include a number of storage servers for storing data relating to users' email accounts, such as the text of emails, preferences for displaying emails and so forth. In such large, geographically distributed data centers, or even at a single location, which can store very large amounts of data, there is an issue of optimal data placement. Typically, some information in the data center is accessed more often or has a different access pattern. Some parts of the distributed system are constantly under traffic or resource pressure while others are always underutilized. This creates an imbalance as far as the resource utilization of the data center is concerned.
Existing approaches to optimal data placement typically use ad hoc techniques that require intensive manual labor to distribute the data load across the nodes of the data center. These approaches lack automation, and require extensive manual analysis of data access patterns, traffic behavior, and the like. Furthermore, the resulting data balancing strategy is static. Once a balancing strategy has been agreed upon based on some manual investigation, the data in the system is manually rebalanced to relocate specific data from one node to another. This is expensive in terms of the man-hours that have to be spent in analyzing data points to come up with a recommendation. Moreover, operators are required to be closely involved in the relocation process. Human intervention is error prone and can fail in unexpected ways which are costly to repair. An additional drawback is that scalability is not provided as the data center grows. In fact, the cost of data relocation can increase exponentially as the size of the data center grows.