The present invention relates generally to a method for the allocation of data on physical media by a file system that eliminates duplicate data. The present invention relates to such a method that is implemented in computer software code running on computer hardware.
The operation of computers are very well known in the art. Such a file system exists on a computer or across multiple computers, where each computer typically includes data storage, such as a hard disk or disk(s), random access memory (RAM) and an operating system for executing software code. Software code is typically executed to carry out the purpose of the computer. As part of the execution of the computer code, storage space on the hard disk or disks and RAM are commonly used. Also, data can be stored, either permanently or temporarily on the hard disk or disks and in RAM. The structure and operation of computers are so well known in the art that they need not be discussed in further detail herein.
In the field of computers and computing, file systems are also very well known in the art to enable the storage of such data as part of the use of the computer. A computer file system is a method for storing and organizing computer files and the data they contain to make it easy to find and access them. File systems may use data storage devices such as a hard disks or CD-ROMs and involve maintaining the physical location of the files, and they might provide access to data by the computer operating system or on a file server by acting as clients for a network protocol (e.g., NFS, SMB, or 9P clients). Also, they may be virtual and exist only as an access method for virtual data.
More formally, a file system is a special-purpose database for the storage, organization, manipulation, and retrieval of data. This database or table which centralizes the information about which areas belong to files, are free or possibly unusable, and where each file is stored on the disk. To limit the size of the table, disk space is allocated to files in contiguous groups of hardware sectors called clusters. As disk drives have evolved, the maximum number of clusters has dramatically increased, and so the number of bits used to identify each cluster has grown. For example, FAT, and the successive major versions thereof are named after the number of table element bits: 12, 16, and 32. The FAT standard has also been expanded in other ways while preserving backward compatibility with existing software.
File systems are specialized databases which manage information on digital storage media such as magnetic hard drives. Data is organized using an abstraction called a file which consists of related data and information about that data (here after referred to as metadata). Metadata commonly consists of information like date of creation, file type, owner, and the like.
The file system provides a name space (or a system) for the unique naming of files. File systems also frequently provide a directory or folder abstraction so that files can be organized in a hierarchical fashion. The abstraction notion of file and folders does not represent the actual physical organization of data on the hard disk only its logical relationships.
Hard disks consist of a contiguous linear array of units of storage referred to as blocks. Blocks are all typically the same size and each has a unique address used by the disk controller to access the contents of the block for reading or writing. File systems translate their logical organization into the physical layer by designating certain address as special or reserved. These blocks, often referred to as super-blocks, contain important information about the file system such as file system version, amount of free space, etc. They also contain or point to other blocks that contain structures which describe directory and file objects.
One of the most important activities performed by the file system is the allocation of these physical blocks to file and directory objects. Typically each file consists of one or more data blocks. If files are stored on the file-system which contains identical data blocks, no provision is made to identify that these blocks are duplicates and avoid the allocation of (wasted) space for these duplicate blocks. The present invention relates to a method, using an algorithm implemented in software processing steps in a computer, for determining if a new block of data is a duplicate.
In the prior art, there is a well known method that is used to determine if two data blocks are identical without the exhaustive comparison of each bit in the data block. This is commonly referred to as “compare on hash”. A hash is a mathematical function that produces a fixed length bit sequence that uniquely identifies any variable length input data. Hash functions are commonly used in cryptography to generate digital signatures that change if a data buffer differs by even one bit from the original buffer used to generate the hash. The size of the hash code, in bits, is called the digest size. The larger the digest size, the more resistant the hash algorithm is to random collision, which is the creation of matching hashes from data blocks which do not match.
The present invention relates to a method that requires a cryptographic quality hash with a digest size of at least 192 bits. There is a need to compute the hash for each data block written to the file system so that hash values can be compared to determine if data blocks, such as ones that are very large in size, are equivalent.
The prior art suffers from the disadvantage that it mush compare the full hash code to determine whether a new data block is a duplicate block or not. As can be understood, this is particularly problematic with large digest sizes, such as those that are 192 bits in length.
To address these problems associated with file systems of non-trivial (i.e. very large) size, highly optimal search structures are needed. As can be understood, inefficient search structures will cause significant degradation of file system performance as more blocks, and hence more searches, are managed by the system. The prior art fails to provide such an optimized search structure and method. Therefore, there is a need for a more efficient search algorithm and better way for the hash data to be stored to enable more efficient searching to, in turn, realize faster and more efficient determination of whether a new data block is duplicate data.
In view of the foregoing, there is a need to provide a method for the allocation of data on physical media by a file system that eliminates duplicate data.
There is a need for a more efficient and optimized search structure.
There is a need for a more efficient search algorithm to reduce I/O load on a system.
There is also a need for method of determining whether a new data block is a duplicate that can better handle large hash files.
There is a further need to provide a method that can better store hash values for more efficient searching.
Yet another need is to provide a method that can reduce the number of search operations when determining whether a new data block is a duplicate.
There is also a need to reduce the time for determining whether a new data block is a duplicate.