Deduplication is a process for removing redundant data during data backup operations. In particular, if two saved objects are duplicates of each other, only one of the objects is stored, thus reducing the total amount of data being stored. Deduplication has become ubiquitous in capacity optimized storage systems, and relies on a process of comparing byte patterns (data chunks) to stored copies and replacing redundant chunks with reference pointers to identical stored chunks.
Deduplication processes may have an impact on the operation of file systems that organize the data on the storage media. Databases and file systems commonly use a B-tree file structure, as it is optimized for systems that read and write large blocks of data. A B+Tree (and other variants of a standard B-tree) data structure keeps the data sorted on disk and allows update, deletion, insertion and lookups (searches) of records in logarithmic time. Each record in a B+Tree is associated with a key. Non-related records of a B+Tree are generally stored at different locations on the disk, which results in a random access of B+Tree records. However, if the neighboring pages in a B+Tree are adjacent to each other, and are accessed sequentially there is no random read penalty, which results in faster B+Tree scans. In a deduplication system, B+Tree scans may result in a random read. Even in a non-deduplication system, if the neighboring leaf pages are not contiguously stored on disk, scans of the B+Tree will be random; there is no benefit from the lower layers doing the read ahead of contiguous blocks in this scenario as the adjacent nodes of the B+Tree are not contiguous on disk. Thus B+Tree scans will also result in a random read; which may result in B+Tree scans being slow.
What is needed, therefore, is a system and method of speeding scans of B+Trees in deduplication systems.