The present invention relates generally to data storage systems, and systems and methods to improve storage efficiency, compactness, performance, reliability, and compatibility. In general, data storage systems receive and store all or portions of arbitrary sets or streams of data. Data storage systems also retrieve all or portions of arbitrary sets or streams of data. A data storage system provides data storage and retrieval to one or more storage clients, such as user and server computers. Stored data may be referenced by unique identifiers and/or addresses or indices. In some implementations, the data storage system uses a file system to organize data streams into files. Files may be identified and accessed by a file system path, which may include a file name and one or more hierarchical file system directories. In other embodiments, data streams may be arbitrary sets of data that are not associated with any type of file system or other hierarchy.
Cloud storage services are one type of data storage available via a wide-area network. Cloud storage services provide storage to users in the form of a virtualized storage device available via a wide-area network (WAN), such as the Internet or a private WAN. In general, users access cloud storage services to store and retrieve data using web services protocols, such as REST, SOAP, or XML-RPC. Cloud storage service providers manage the operation and maintenance of the physical data storage devices; therefore, users of cloud storage services can avoid the initial and ongoing costs associated with buying and maintaining storage devices. Users of cloud storage services also avoid the administrative complexity arising from configuring, managing, and maintaining their own data storage systems. Cloud storage services typically charge users for consumption of storage resources, such as storage space and/or transfer bandwidth, on a marginal or subscription basis, with little or no upfront costs. In addition to the cost and administrative advantages, cloud storage services often provide dynamically scalable capacity to meet its users changing needs.
Many data storage systems are tasked with handling enormous amounts of data. Although cloud storage services often have sufficient storage capacity to store large data sets, the bandwidth limitations of the wide-area network connecting storage clients with the cloud storage service make transferring large data sets to the cloud service time consuming.
To reduce the time and bandwidth required to transfer large data sets over a WAN, WAN optimization devices may be used in pairs on both sides of the WAN connection. WAN optimization typically improves the data transfer rates over a WAN by compressing data at the source using a first WAN optimization device, communicating the compressed data via the WAN to a destination, and then decompressing the data at the destination using a second WAN optimization device. However, this double-ended WAN optimization technique is difficult and expensive to use with cloud storage services, because the cloud storage service provider must purchase and configure a WAN optimization device to work with the WAN optimization devices at the storage clients' locations.
Another prior approach to reducing the time and bandwidth required to transfer data to a cloud storage service over a WAN is to store the data in its compressed form in the cloud storage service. In this approach, a WAN optimization device compresses the data at the source and communicates the compressed data via the WAN to cloud storage service. The cloud storage service then stores the data in its compressed form. Although this approach eliminates the need for a second WAN optimization device at the cloud storage service, the stored data is no longer in its native, uncompressed form. As a result, any storage client that wishes to read the data stored in the cloud storage service must include or be associated with a WAN optimization device to decompress and convert the data back to its original form. This requirement is especially burdensome where the cloud data storage is used to distribute data to a large number of users (such as part of a content distribution network) or where the cloud data storage is used to deploy applications and data to distributed or cloud computing systems.