Cloud storage enables data to be stored on the Internet at a remote storage site rather than, or in addition to storing data on-premises. Cloud storage typically refers to a hosted object storage service. In some cases, cloud storage may offer a massively scalable object store for data objects, a file system service for the cloud, a messaging store for reliable messaging, and the like. Redundancy within cloud storage is used ensure that data is safe in the event of transient hardware failures. Data may be replicated across datacenters or geographical regions of the cloud storage for additional protection. Data that is written to cloud storage may also be encrypted to ensure security. Cloud storage may provide fine-grained control over who has access to data. In addition, providers may handle maintenance and any critical problems that occur with the cloud storage and its services thereby alleviating clients from such tasks. Cloud storage is also accessible on a global basis making access to data more convenient.
Cloud storage may include a layered storage architecture that uses, at its lowest layer, large append-only files which can be referred to as “extents.” The extents are often replicated (e.g., three-way replicated, etc.) across multiple storage nodes for data durability. Multiple user blobs of arbitrary size may be collocated in the same extent, another common technique designed to maximize the bandwidth of the underlying storage media. As blobs are deleted and/or overwritten by a user, the blobs no longer in use leave holes of unused space within the extent. Because extents are append-only, the holes are unusable space until the entire extent is reclaimed by a garbage collection background job that gathers blobs still in use from an extent and re-writes them into a new extent. The garbage collection process then returns the old extent back into a pool where it can be re-used for storage.
One of the requirements of cloud storage is to ensure data durability. Accordingly, the new extent is replicated across multiple nodes to account for the event of failure at one node. Furthermore, the replicated extent is a temporary state because when the extent fills up, the extent then receives additional processing such as erasure coding and the extent is deleted. However, this replication process consumes network resources and requires the cloud to redundantly store the same extent on multiple servers. Accordingly, what is needed is an improved process for durable storage of in use data collected through garbage collection.