Conventionally, data de-duplication (“dedupe”) may be performed to reduce the amount of data stored on a standalone computer. Significant data reductions have been achieved with conventional dedupe. While there are many standalone computers that benefit from conventional dedupe, there are many computers that do not stand alone. Computers may be arranged in networks, in client/server configurations, and may even participate in cloud-based computing. Dedupe may be performed in these non-standalone configurations.
When computers can communicate, computers may engage in collaborative processing. Computers may even engage in collaborative dedupe. In collaborative dedupe, some processing may be performed at clients that want data to be deduped and some processing may be performed at servers that dedupe the data. Additionally, some data may be stored at clients while some data may be stored at servers. Traditionally, the majority of the processing was performed at the server and the overwhelming majority of the data was stored at or by the server. Processing includes chunking, hashing, sampling, making a determination whether a chunk is unique, making a determination whether a chunk is similar to another chunk, searching indexes, and so on. Stored data includes the raw data, index data, metadata, and so on.
Conventionally, if collaborative dedupe was attempted, the collaborating computers would dedupe according to a set of pre-established rules and protocols. The set of rules and protocols may have been designed with certain assumptions in mind. While this assumption-based approach may have allowed the initial generic dedupe collaborations, the assumptions may have yielded sub-optimal performance in some configurations and may have yielded no improvements at all in other configurations. For example, sub-optimal performance may have been observed in some computers acting in client/server pairs or some computers engaged in cloud-based dedupe. Similarly, if collaborative dedupe was attempted, the collaborating computers would perform pre-defined roles according to a pre-defined distribution of work. Once again, the roles and distribution may have been designed with certain assumptions in mind, and those assumptions may have lead to sub-optimal performance.
One type of conventional dedupe includes chunking a larger data item (e.g., object, file) into sub-blocks, sampling the sub-blocks or computing an identifier (e.g., hash) for the sub-blocks, and processing the samples or identifiers instead of the sub-blocks. Chunking includes selecting boundary locations for fixed and/or variable length sub-blocks. Hashing includes computing a hash of the resulting chunk. Sampling includes selecting a subset of the resulting chunk. A chunk may also be referred to as a sub-block. Comparing relatively smaller hashes (e.g., 128 bit cryptographic hash) to make a unique/duplicate decision can be more efficient than comparing relatively larger chunks (e.g., 1 kB, 128 kB, 1 MB) of data using a byte-by-byte approach. Regardless of the dedupe particulars (e.g., chunking approach, hashing approach), it may be desirable to engage in collaborative cloud-based dedupe. Collaborative cloud-based dedupe may involve communicating data to be deduped, computing, recalling or communicating information about data to be deduped, communicating information about dedupe processing, and so on, between clients and servers using the cloud.
The term “cloud” is often used as a metaphor for the Internet. “Cloud computing” refers to network (e.g., internet) based computing where shared resources, software, interfaces, and information are provided to computers and other devices on demand. On-demand provision of resources in cloud computing is often compared to providing electricity on-demand because like the electricity grid, a cloud can provide dynamically scalable resources. In cloud computing, software, data, services, devices, and other entities reside at servers that can be accessed through a communication mechanism (e.g., network).
One model for cloud computing is to have multiple servers, each of which does something really well, and all of which work well together. Thus, multiple dedupe servers may be available, and different dedupe servers may be optimized for different functions. However, initial, conventional collaborative dedupe may not have recognized that different dedupe servers could provide different services, with different costs under different conditions (e.g., latency, error rate, security).
In cloud-based computing, customers may rent, lease, or subscribe for usage from a third-party provider. The customers consume resources (e.g., bandwidth, packets delivered, data stored, processor cycles used) as a service and pay for the resources consumed. The customers may be billed using a utilities model (e.g., electricity), on a subscription basis, or otherwise. Cloud-based computing customers may become interested in data de-duplication to reduce the amount of data stored and transmitted using the cloud.