Large-scale network-based services often require large-scale data storage. For example, Internet email services store large quantities of user inboxes, each user inbox itself including a sizable quantity of data. This large-scale data storage is often implemented in datacenters comprised of storage and computation devices. The storage devices are typically arranged in a cluster and include redundant copies. This redundancy is often achieved through use of a redundant array of inexpensive disks (RAID) configuration and helps minimize the risk of data loss. The computation devices are likewise typically arranged in a cluster.
Both sets of clusters often suffer a number of bandwidth bottlenecks that reduce datacenter efficiency. For instance, a number of storage devices or computation devices can be linked to a single network switch. Network switches are traditionally arranged in a hierarchy, with so-called “core switches” at the top, fed by “top of rack” switches, which are in turn attached to individual computation devices. The “Top of rack” switches are typically provisioned with far more collective bandwidth to the devices below them in the hierarchy than to the core switches above them. This causes congestion and inefficient datacenter performance. The same is true within a storage device or computation device: a storage device is provisioned with disks having a collective bandwidth that is greater than a collective bandwidth of the network interface component(s) connecting them to the network. Likewise, computations devices are provisioned with an input/output bus having a bandwidth that is greater than the collective network interface bandwidth. In both cases, the scarcity of network bandwidth causes congestion and inefficiency.
To resolve these inefficiencies and bottlenecks, many datacenter applications are implemented according to the “Map-Reduce” model. In the Map-Reduce model, computation and storage devices are integrated such that the program reading and writing data is located on the same device as the data storage. The MapReduce model introduces new problems for programmers and operators, constraining how data is placed, stored, and moved to achieve adequate efficiency over the bandwidth-congested components. Often, this may require fragmenting a program into a series of smaller routines to run on separate systems.