Large-scale network-based services often require large-scale data storage. For example, Internet email services store large quantities of user inboxes, each user inbox itself including a sizable quantity of data. This large-scale data storage is often implemented in datacenters comprised of storage and computation devices. The storage devices are typically arranged in a cluster and include redundant copies. This redundancy is often achieved through use of a redundant array of inexpensive disks (RAID) configuration and helps minimize the risk of data loss. The computation devices are likewise typically arranged in a cluster.
Both sets of clusters often suffer a number of bandwidth bottlenecks that reduce datacenter efficiency. For instance, a number of storage devices or computation devices can be linked to a single network switch. Network switches are traditionally arranged in a hierarchy, with so-called “core switches” at the top, fed by “top of rack” switches, which are in turn attached to individual computation devices. The “Top of rack” switches are typically provisioned with far more bandwidth to the devices below them in the hierarchy than to the core switches above them. This causes congestion and inefficient datacenter performance. The same is true within a storage device or computation device: a storage device is provisioned with disks having a collective bandwidth that is greater than a collective network interface component bandwidth. Likewise, computations devices are provisioned with an input/output bus having a bandwidth that is greater than the collective network interface bandwidth.
To increase efficiency, many datacenter applications are implemented according to the Map-Reduce model. In the Map-Reduce model, computation and storage devices are integrated such that the program read and writing data is located on the same device as the data storage. The Map-Reduce model introduces new problems for programmers and operators, constraining how data is placed, stored, and moved to achieve adequate efficiency over the bandwidth-congested components. Often, this may require fragmenting a program into a series of smaller routines to run on separate systems.
In addition to bottlenecks caused by network-bandwidth, datacenters also experience delays when retrieving large files from storage devices. Because each file is usually stored contiguously, the entire file is retrieved from a single storage device. Thus, the full bandwidth of the single storage device is consumed in transmitting the file while other storage devices sit idle.
Also, datacenter efficiency is often affected by failures of storage devices. While the data on a failed storage device is usually backed up on another device, as mentioned above, it often takes a significant amount of time for the device storing the backed up data to make an additional copy on an additional device. And in making the copy, the datacenter is limited to the bandwidths of the device making the copy and the device receiving the copy. The bandwidths of other devices of the datacenter are not used.
Additionally, to efficiently restore a failed storage device, the storage device and its replica utilize a table identifying files stored on the storage device and their locations. Failure to utilize such a table requires that an entire storage device be scanned to identify files and their locations. Use of tables also introduces inefficiencies, however. Since the table is often stored at a different location on the storage device than the location being written to or read from, the component performing the reading/writing and table updating must move across the storage device. Such movements across the storage device are often relatively slow.