The present invention relates to data transfer in a wide area network (WAN) environment, and more specifically, the present invention relates to transferring data files between at least two independent file servers connected via a WAN, and an independent file server connected to a WAN configured for transferring data files.
WAN caching, such as Panache by IBM Corp., a parallel wide area network cache for remote file cluster, is able to mask the fluctuating WAN latencies and outages by supporting asynchronous and disconnected-mode operations. It allows for concurrent updates at one file server and at another remote file server, and resolves those updates by using conflict detection techniques to flag and handle conflicts.
A typical WAN caching environment according to the prior art comprises a first file server including a first file storage such as a hard disk drive (HDD) or solid state drive (SSD), and at least one first file system. Such a first file system may provide access to data files stored on the first file storage. The first file server is connected to a first user which reads and writes data files through the at least one first file system in the first file storage. Similarly, a second file server includes a second file storage such as a HDD or a SSD, and at least one second file system. The second file system is able to provide access to data files stored on the second file storage. The second file server is connected to a second user which reads and writes data files through the at least one second file system in the second file storage.
Both file servers are interconnected via a WAN. While a WAN is able to span large distances over hundreds and thousands of miles, its throughput and bandwidth is limited. Therefore, the file transfer between the two file servers underlies these WAN limitations.
In the prior art WAN caching environment, the at least one first file system of the first file server is able to provide access to data files stored in the second file system provided by the second file server without actually having these data files from the second file server present in the first local file storage. Thus, the first user connected to the first local file server sees all data files which are linked by the at least one first file system to the at least one second file system of the second file server. Only when the first user accesses such data files then the file is transmitted through the WAN from the second file server to the first file server if the data file is not already present in the first file server.
Likewise, data files stored in the first file storage of the first file servers may be seamlessly replicated to the second file server via the WAN. If the first file storage of the first file server gets full, then the data files which are already replicated to the second file server may be evicted from the first file storage. However, the first user connected to the first file server still sees these data files and when these data files are accessed, then this data file is transmitted from the second file server to the first file server.
Thus, WAN caching has the advantage that files may be stored in a remote location and are still visible locally without actually transmitting the files via the WAN. This reduces the required bandwidth and throughput requirements of a WAN and makes the usage of the WAN for file sharing purposes attractive.
Data deduplication according to the prior art has been introduced in a client-server environment, for example by Tivoli Storage Manager Version 6.2 by IBM Corp., and Avamar by EMC Corp. for client-server deduplication. The deduplication process first chunks a data object, such as a data file, and calculates an identity characteristic for each data chunk. Common methods used to calculate an identity characteristic are cryptographic hash functions, whereby each chunk of data is associated with a hash-digest. Cryptographic hash functions include Secure Hash Algorithm (SHA-1, SHA-256, SHA-512, etc.), Message Digest (MD5), etc.
The output of the hash function is called a “digest.” The cryptographic hash function is an “avalanche function” in that a tiny difference (even 1 bit) in the input to the hash function results in a huge (highly non-linear) difference in the digest (output). The second step is to compare the hash-digest with the assumption that two identical hash-digests have identical chunks of data. In the third step, only the chunk with non-identical hash-digests are stored, the chunks with identical hash-digests are referenced to one instance of the chunk which is stored as well. Thus data deduplication reduces the required storage capacity for a given set of data by eliminating duplicative data.
Addressing hash collisions—where one hash-digest references two non-identical chunks of data—may be done by prior art methods including stronger hash algorithms, multiple (nested) hash algorithms per chunk, and binary comparison of the data chunks.
In a client-server deduplication environment, the client chunks the data and creates the hash-digests which are the output of the cryptographic hash functions and sends the hash-digests of the data chunks to the server. The server determines non-identical hash-digests and only requests data chunks with non-identical hash-digests. The client or the server stores the construction plan for the entire data based on the chunks. The construction plan describes the location of data chunks within a file. Thus with client-server deduplication, less data is being transferred between the client and the server.
As an alternate, the data may be deduplicated on a file basis, rather than on a chunk basis. The disadvantage of client-server deduplication is that it only works in one direction. Typically, the server keeps the main hash-digest repository and checks for identical hash-digests. The client does not store the hash-digest information and when data is being transmitted from the server to the client then the data is not deduplicated.