Data transfer is a common problem, especially due to the increasing data generation in recent years. Data needs to be transferred from one place to another, in order to be processed and stored for further analysis or backup. Traditionally, data files are transferred using techniques that analyze their data blocks. For some industries and scientific applications, data files are hierarchical, containing variable-value pairs; formats of these data files are, for example, HDF4, HDF5, NetCDF, and GRIB.
BitTorrent is based on the concept of “segmented file transfer”, in which the original file is transferred from a variety of sources in chunks of fixed size. Implementation of BitTorrent may use a technology called SET (Similarity Enhanced Transfer) to accelerate download. SET technology finds similar copies of the file requested by the user and looks for subsets of those copies that match subsets of the requested file. If a similar copy is found, then the additional copies can be used as additional download sources. The technique used by SET is called handprinting, by which remote files are hashed using a dynamic window size (e.g. Rabin fingerprinting), and then a few selected hashes are inserted into a global lookup table. To find similar files, a receiver obtains the chunk hashes for its desired file and searches for matches in the global lookup table. A match indicates that the remote file(s) can be used as additional download source(s).
Deduplication utilizes a technique similar to that of SET. Hashes of the dynamic window size are computed for a set of files. Chunks with the same hash are then saved only once in the destination storage device, thus saving storage space. A rich software infrastructure needs to be developed on top of this functionality, in order to keep track of the number of references to each chunk. It is worth noting that a file may be sliced in hundreds or thousands of chunks, according to the parameters utilized by the hashing algorithm.
Rsync is used to transfer files from a sender to a receiver computer. A file list is prepared by the sender, including pathnames, ownership, access mode, permissions, size, and modification time stamp. A checksum can also be included in the file list. The file list is sent to the receiver computer, which checks whether the pathnames contained in the list exist in the local file system. The modification time stamp, size, and checksum are used to determine whether files can be skipped or not. If a file does not exist or if it exists but is outdated (or incomplete), then it is not eligible for skipping. In the case that the file does not exist at the destination, the file is sent in its whole by the sender. If partial data exists at the destination, then the sender only transfers the differences between that partial file (or an old copy of the file). The process happens by defining a block size (which may vary according to the file size), which is used to “slice” the file. The checksum of each slice is calculated and slices with the same checksum at both sender and receiver side are skipped. Slices with different checksums are transferred from the sender to the receiver and the old data at the receiver side is overwritten by the new data.