The enterprise computing landscape has changed. The central-service architecture has given way to distributed storage clusters. Storage clusters can be built of commodity PCs that can deliver high performance, availability and scalability, and at a lower cost compared to monolithic disk arrays. The data is replicated across multiple geographical locations, which increases availability and reduces network distance from clients.
In a distributed storage system, objects (e.g., blobs) are dynamically created in different clusters. New visualization, multimedia, and other data-intensive applications use very large objects, with individual objects consuming hundreds of gigabytes of storage space or more, and the trend toward larger objects is expected to continue. When such objects are uploaded into a distributed storage system, other objects with identical content may already exist, or multiple instances of an object that suddenly became popular may be uploaded around the same time (e.g., as an email attachment to an email delivered to multiple recipients). De-duplication is a technique that allows storage of one physical replica of the content for a plurality of identical objects in the same storage cluster. This avoids passing the content around when it matches an existing object (e.g., by comparing a content hash), thus saving storage resources and network bandwidth. The value of de-duplication increases as the number of objects that reference a single physical content replica increases (e.g., a new 50 Gigabyte video can “go viral” from social networking, creating hundreds of thousands of copies in a very short period of time).