A data block may refer to a sequence of bytes having a nominal length, referred to as the block size, which is stored on a storage volume. A file system may continually read and write data blocks to the storage volume. A file system may often require that the block size of each data block be a fixed size, such as 4 kilobytes (KB). The continual reading and writing of the data blocks may cause the storage volume to accumulate duplicate data blocks. The duplicate data blocks may be eliminated from the storage volume through a process referred to as deduplication. A storage volume may only require one instance of a data block to be stored on the physical storage drive; additional instances of the data block may be replaced with a pointer to the original data block. By eliminating duplicate data blocks, and referencing the original data block, immediate benefits may be obtained through storage space efficiencies.
In one example, a backup operation may be performed on a dataset on a periodic basis, such as on a daily basis. The dataset may store data in fixed size data blocks, such as 4 KB. The dataset may not change significantly between each backup operation and therefore each backup may include a large amount of duplicate data. Significant storage space efficiencies may be obtained by eliminating the duplicate data from each backup. However, if the dataset is slightly modified, e.g. data is inserted in the middle of the dataset, the data contained within each data block of the dataset may be modified, or shifted. Since the data contained within each data block has been modified, the modified data blocks may no longer be duplicates of stored data blocks, and may not be able to be deduplicated. Thus, although the dataset includes a significant amount of duplicate data, a deduplication operation may not be able to effectively deduplicate the dataset.
Since the dataset contains a large amount of duplicate data, it may also be desirable to identify and eliminate the duplicate data in real-time as the data is received, such that the duplicate data may not be written to the storage volume. The process of deduplicating a dataset in real-time may be referred to as inline deduplication, whereas the process of writing a dataset to a storage volume and then deduplicating the dataset may be referred to as post-processing deduplication. In an inline deduplication, a backup operation may communicate a dataset through a data stream to a deduplication operation. The deduplication operation may deduplicate the data stream as it is received and write any new data to the storage volume. In order to compare the incoming data stream to stored data blocks, the deduplication operation may split the data stream into segments which are of equal length as the stored data blocks. For example, if the stored data blocks are a fixed size, such as 4 KB, the data stream may be split into 4 KB segments in order to be deduplicated. However, the deduplication operation may not be able to account for small shifts in the data stream if the data stream is split into fixed length segments. Thus, the inline deduplication may fail to effectively deduplicate the data when there are shifts in the data stream, and in turn the inline deduplication operation may fail to prevent duplicate data from being written to the storage volume. Thus, there may be a need for a more effective operation for deduplicating data on a fixed block size file system.