The present invention relates to deduplication of data blocks, and more specifically, to a method for de-duplicating and managing data blocks within the file system.
File systems typically organize a capacity of an underlaying block storage device in data blocks of a fixed size such as 4 KB, for example. Each file within the file system typically has its own data block(s). Each data block is identified by a 32 bit or 64 bit number starting with zero, this number represents a pointer to a respective data block. Therefore, a conventional file system manages 232 or 264 different data blocks, respectively, which defines the maximum capacity of the file system.
A conventional Unix-like file system uses inodes to store metadata of each file. FIG. 1 illustrates a data structure 1 for inodes of a conventional file system. As shown in FIG. 1, the data structure 1 includes an i_size field, for example, that records the size of the file in bytes. Depending on the file's size, more or less data blocks are required to store the file's content. The data structure also includes an i_block[EXT2_N_BLOCKS] array, which is an array of typically fifteen 32-bit numbers that point to the file's associated data blocks. FIG. 2 illustrates a relationship between an inode and data blocks within a conventional file system. As shown in FIG. 2, an inode 10 is shown including a plurality of i_blocks [0]-i_blocks [14]. The first twelve i_blocks [0]. . . [11] point to the first twelve data blocks, i.e., direct data blocks 15. The i_block[12] element points to a data block (i.e., an indirect block of pointers) 16 which point to indirect data blocks 18. The data block 16 addressed by element i_block[12] does not contain any file data itself; instead, it includes additional i_block[. . . ] elements which point to additional data blocks 18. The i_block[13] element points to an indirect block of pointers 20, for which each element points to a double indirect block of pointers 22, for which each element in turn points to a double indirect data block 24. In addition, i_block[14] element points to an indirect block of pointers 26, for which each element points to a double indirect block of pointers 28, for which each element points to a triple indirect block of pointers 29, for which each element points to triple indirect data blocks 30. If given a block size of 4 kB and a pointer size of 32-bit (4 bytes), a data block can store approximately 1024 pointers. FIG. 3 illustrates a typical 32-bit pointer 40 which stores a unique 32-bit number addressing a respective data block.
A conventional file system may include approximately 232 or 264 different data blocks resulting in a maximum file system size of multiple terabyte (TB) where newer file systems provide even larger capacities. Therefore, a file system may store a large amount of data, and in order to reduce the amount of physical space required to store the file system's data, a conventional deduplication method may be performed.
Data Deduplication is performed to search for duplicate data objects, such as blocks, chunks, or files, and discards the duplicates, thereby providing a 20:1 reduction of stored data. Once duplicate data is identified, the duplicate data is replaced by a pointer which points to a parent copy of the data, to reduce the amount of data stored.