The present invention relates generally to a method for a method for moving data between two of more data storage systems. The present invention relates to such a method that is implemented in computer software code running on computer hardware.
The operation of computers are very well known in the art. File systems exist on a computer or across multiple computers, where each computer typically includes data storage, such as a hard disk or disk(s), random access memory (RAM) and an operating system for executing software code. Software code is typically executed to carry out the purpose of the computer. As part of the execution of the computer code, storage space on the hard disk or disks and RAM are commonly used. Also, data can be stored, either permanently or temporarily on the hard disk or disks and in
RAM. The structure and operation of computers are so well known in the art that they need not be discussed in further detail herein.
In the field of computers and computing, file systems are also very well known in the art to enable the storage of such data as part of the use of the computer. A computer file system is a method for storing and organizing computer files and the data they contain to make it easy to find and access them. File systems may use data storage devices such as a hard disks or CD-ROMs and involve maintaining the physical location of the files, and they might provide access to data by the computer operating system or on a file server by acting as clients for a network protocol (e.g., NFS, SMB, or 9P clients). Also, they may be virtual and exist only as an access method for virtual data.
More formally, a file system is a special-purpose database for the storage, organization, manipulation, and retrieval of data. This database or table which centralizes the information about which areas belong to files, are free or possibly unusable, and where each file is stored on the disk. To limit the size of the table, disk space is allocated to files in contiguous groups of hardware sectors called clusters. As disk drives have evolved, the maximum number of clusters has dramatically increased, and so the number of bits used to identify each cluster has grown. For example, FAT, and the successive major versions thereof are named after the number of table element bits: 12, 16, and 32. The FAT standard has also been expanded in other ways while preserving backward compatibility with existing software.
File systems are specialized databases which manage information on digital storage media such as magnetic hard drives. Data is organized using an abstraction called a file which consists of related data and information about that data (here after referred to as metadata). Metadata commonly consists of information like date of creation, file type, owner, and the like.
The file system provides a name space (or a system) for the unique naming of files. File systems also frequently provide a directory or folder abstraction so that files can be organized in a hierarchical fashion. The abstraction notion of file and folders does not represent the actual physical organization of data on the hard disk only its logical relationships.
Hard disks consist of a contiguous linear array of units of storage referred to as blocks. Blocks are all typically the same size and each has a unique address used by the disk controller to access the contents of the block for reading or writing. File systems translate their logical organization into the physical layer by designating certain address as special or reserved. These blocks, often referred to as super-blocks, contain important information about the file system such as file system version, amount of free space, etc. They also contain or point to other blocks that contain structures which describe directory and file objects.
One of the most important activities performed by the file system is the allocation of these physical blocks to file and directory objects. Typically each file consists of one or more data blocks. If files are stored on the file-system which contains identical data blocks, no provision is made to identify that these blocks are duplicates and avoid the allocation of (wasted) space for these duplicate blocks.
When data is moved between two or more data storage systems, it is common that the storage space used to store information on both the sending and receiving data storage systems are optimized utilizing a data deduplication technique. Data deduplication is a method in which only unique data is physically kept in a data storage system. However, known deduplication techniques are inefficient and result in the transfer of unnecessary data blocks. For example, in the prior art, unique data is referenced by a unique “fingerprint” derived from the data often in the form of a cryptographic hash function. Deduplication methods compare the fingerprint of incoming data blocks to the fingerprints of all existing data blocks. If the incoming data block is unique it is stored, if it is not unique it is not stored but is added as a reference to the existing unique data block. When data is copied from one system to another there is a probability that the data being copied is already stored in a data block on the receiving system. This method relates to the process of sending data in a manner in which only unique data blocks are transferred from the sender to the receiver. In this context, unique blocks are blocks that are already stored on the receiving system (e.g. via prior transactions with other systems).
Therefore, there is a need to eliminate the need to send duplicate blocks to significantly reduce the time and network resources needed to accomplish the data transfer.