The present invention relates generally to a method for a method for the management of meta-data needed to perform data deduplication in data storage systems. The present invention relates to such a method that is implemented in computer software code running on computer hardware.
The operation of computers are very well known in the art. File systems exist on a computer or across multiple computers, where each computer typically includes data storage, such as a hard disk or disk(s), random access memory (RAM) and an operating system for executing software code. Software code is typically executed to carry out the purpose of the computer. As part of the execution of the computer code, storage space on the hard disk or disks and RAM are commonly used. Also, data can be stored, either permanently or temporarily on the hard disk or disks and in RAM. The structure and operation of computers are so well known in the art that they need not be discussed in further detail herein.
In the field of computers and computing, file systems are also very well known in the art to enable the storage of such data as part of the use of the computer. A computer file system is a method for storing and organizing computer files and the data they contain to make it easy to find and access them. File systems may use data storage devices such as a hard disks or CD-ROMs and involve maintaining the physical location of the files, and they might provide access to data by the computer operating system or on a file server by acting as clients for a network protocol (e.g., NFS, SMB, or 9P clients). Also, they may be virtual and exist only as an access method for virtual data.
More formally, a file system is a special-purpose database for the storage, organization, manipulation, and retrieval of data. This database or table, which centralizes the information about which areas belong to files, are free or possibly unusable, and where each file is stored on the disk. To limit the size of the table, disk space is allocated to files in contiguous groups of hardware sectors called clusters. As disk drives have evolved, the maximum number of clusters has dramatically increased, and so the number of bits used to identify each cluster has grown. For example, FAT, and the successive major versions thereof are named after the number of table element bits: 12, 16, and 32. The FAT standard has also been expanded in other ways while preserving backward compatibility with existing software.
File systems are specialized databases, which manage information on digital storage media such as magnetic hard drives. Data is organized using an abstraction called a file, which consists of related data and information about that data (here after referred to as metadata). Metadata commonly consists of information like date of creation, file type, owner, and the like.
The file system provides a name space (or a system) for the unique naming of files. File systems also frequently provide a directory or folder abstraction so that files can be organized in a hierarchical fashion. The abstraction notion of file and folders does not represent the actual physical organization of data on the hard disk only its logical relationships.
Hard disks consist of a contiguous linear array of units of storage referred to as blocks. Blocks are all typically the same size and each has a unique address used by the disk controller to access the contents of the block for reading or writing. File systems translate their logical organization into the physical layer by designating certain address as special or reserved. These blocks, often referred to as super-blocks, contain important information about the file system such as file system version, amount of free space, etc. They also contain or point to other blocks that contain structures, which describe directory and file objects.
One of the most important activities performed by the file system is the allocation of these physical blocks to file and directory objects. Typically each file consists of one or more data blocks. If files are stored on the file-system, which contains identical data blocks, no provision is made to identify that these blocks are duplicates and avoid the allocation of (wasted) space for these duplicate blocks.
Data deduplication is a method in which only unique data is physically kept in a data storage system. The unique data is referenced by a unique “fingerprint” derived from the data often in the form of a cryptographic hash function. Deduplication methods compare the fingerprint of incoming data blocks to the fingerprints of all existing data blocks. If the incoming data block is unique it is stored, if it is not unique it is not stored but is added as a reference to the existing unique data block.
However, in the prior art, a core problem exists relating to the index search needed to determine if a block is unique or a duplicate. As can be understood, such a search becomes more complex as the number of unique blocks in the storage system increase.
The method of the present invention relates to the organization of the meta-data in a search index of data blocks needed to accomplish this search more efficiently.
In view of the foregoing problems, there is a need to minimize the amount of RAM memory needed to accomplish the search.
There is also a need to maximize the performance of the search.
There is yet a further need to ensure that the meta-data used in the search is transactionally secure.