A cloud system generally refers to a group of electronically networked computer servers that may provide centralized data storage and online access to services and resources. In some instances, an enterprise based system or network may be implemented as one or more cloud based systems. In some instances, the networked computer servers and databases and other hardware components in the cloud may be distributed geographically. A large enterprise may have a distributed cloud system with multiple cloud based systems situated at diverse locations. For example, where an enterprise spans nationally or internationally, the enterprise cloud system may be comprised of small clouds (small local infrastructures) as well as large cloud networks (e.g., global data centers). Such cloud based systems are typically electronically networked together over one or more communications networks.
Distributed cloud based systems often need to share information across a variety of communication networks (satellite, Internet, etc.), for example, to transfer or synchronize data between the different systems. In some instances, only limited or fixed network bandwidth network infrastructures are available to transfer data between different resources within a cloud system or between cloud systems. Connectivity can also be unreliable or bandwidth can be inadequate to synchronize a large volume of enterprise data. In addition, remote disconnected computer infrastructures can be placed in remote locations where large volumes of data cannot be transferred reliably due to bandwidth limitations. Finally, the cost of transferring large amounts of data is not trivial and reducing the amount of data that must be transferred to keep clouds synchronized will be financially beneficial.
Some methods of data deduplication use comparison of bytes, strings, and arbitrary chunks of data to determine data deduplication. However, this approach fails to take into consideration the content of the artifact or the corpus of data where that artifact resides. An artifact may refer to a document, image, and any other data objects (e.g., shape files, maps, etc.). One data management technique includes source deduplication, which is the removal of redundancies from data before transmission to the backup target. Source deduplication products may reduce bandwidth and storage usage but increase the workload on the servers and processing elements. Source deduplication compares new blocks of data with previously stored data. If the server has the previously stored data, then the software does not send that data and instead notes that there is a copy of that block of data at that client. If a previous version of a file has already been backed up, the software will compare files and back up any parts of the file it hasn't seen. Source deduplication is well suited for backing up smaller remote backup sets.
A second approach is target deduplication, which is the removal of redundancies from a backup transmission as it passes through an appliance sitting between the source and the backup (e.g. intelligent disk targets (IDTs), virtual tape libraries (VTL)). Target deduplication reduces the amount of storage required at the target but does not reduce the amount of data that must be sent across a long area network (LAN) or wide area network (WAN).
Thus, there exists a need more efficiently compress and transmit information across disparate enterprise systems.