Broadly, this writing discloses a cloud-based de-duplication with transport layer transparency.
The term “cloud” is often used as a metaphor for the Internet. Cloud computing may involve many computers. In cloud computing, software, data, services, devices, and other entities reside at servers. In cloud computing, some of the computers may be data de-duplication clients and some of the computers may be de-duplication servers. Many, or even most of the computers associated with a cloud will be neither de-duplication clients nor de-duplication servers. Data de-duplication may be referred to as “dedupe”.
When a dedupe client sends packets out onto the cloud for dedupe processing, the dedupe client wants a dedupe server(s) to be able to recognize that packet and process it appropriately. Collaborating dedupe clients and servers want to be able to send packets onto the cloud and have the packets accepted by dedupe enabled collaborators. Collaborating dedupe clients and servers also want those packets to be ignored by non-dedupe enabled computers with proper results. Collaborating dedupe clients and servers want to cause the selective accepting and ignoring of packets without breaking any existing functionality and without requiring customized, proprietary interfaces or protocols with the cloud.
Universal and/or standard interfaces and protocols already exist for cloud computing. For example 802.11 defines networking hardware and rules for communicating with a network. Similarly, simple object access protocol (SOAP) defines envelopes, encoding rules, and other standard ways to communicate certain types of packets. A computer may be interacting with many servers in the cloud through these universal and/or standard interfaces and protocols. Dedupe functionality is often added into a client that is already interacting with other services and other servers. Conventionally, adding dedupe functionality may have involved forcing the client to which the dedupe functionality was added to use an interface or protocol other than the universal and/or standard interfaces or protocols. This may be unacceptable in many applications.
One type of conventional dedupe includes chunking a larger data item (e.g., object, file) into sub-blocks, computing hashes or other identifiers for the sub-blocks, and processing the hashes or other identifiers instead of the sub-blocks. Chunking includes selecting boundary locations for fixed and/or variable length sub-blocks while hashing includes computing a hash of the resulting chunk. A chunk may also be referred to as a sub-block. Comparing relatively smaller hashes (e.g., 128 bit cryptographic hash) to make a unique/duplicate decision can be more efficient than comparing relatively larger chunks (e.g., 1 kB, 128 kB, 1 MB) of data using a byte-by-byte approach. Regardless of the dedupe particulars (e.g., chunking approach, hashing approach), it may be desirable to engage in collaborative cloud-based dedupe. Collaborative cloud-based dedupe may involve communicating data to be deduped, information about data to be deduped, information about dedupe processing, and so on, between clients and servers using the cloud.
“Cloud computing” refers to network (e.g., internet) based computing where shared resources, software, interfaces, and information are provided to computers and other devices on demand. On-demand provision of resources in cloud computing is often compared to providing electricity on-demand because like the electricity grid, a cloud can provide dynamically scalable resources. One model for cloud computing is to have multiple components, each of which do something really well, and all of which work well together. Therefore, adding a dedupe functionality on top of pre-existing functionality should not disturb the other functionality.
In cloud-based computing, interactions between entities may be defined by a quality of service (QoS) that is related to a service level agreement. Cloud-based computing customers likely do not own the physical infrastructure they are using to engage in cloud-based computing. Instead, the customers rent, lease, or subscribe for usage from a third-party provider. The customers consume resources (e.g., bandwidth, packets delivered, data stored, processor cycles used) as a service and pay for the resources consumed. The customers may be billed using a utilities model (e.g., electricity), on a subscription basis, or otherwise. Cloud-based computing customers may become interested in data de-duplication to reduce the amount of data stored and transmitted using the cloud. Therefore, cloud-based computing customers may add dedupe functionality to an already functioning configuration. When they add the dedupe functionality, the cloud-based computing customers do not want their existing functionality to break or even to slow down. Conventional proprietary systems have typically required non-transparent interfaces or protocols that have had negative effects on pre-existing configurations.