In a deduplicated storage system, files are stored in a form of deduplicated segments. Segments are stored inside containers, segment references are stored inside the metadata segments and index stores the mapping from fingerprints to container identifiers (IDs) identifying the containers. A segment is called a live segment and a segment reference is called a live reference if they can be reached from the file system's directory name space. The two conditions must hold: 1) the file system maintains a one to one mapping between the (segment, container) pair and the index; and 2) by virtue of the definition of live reference, the segment should exist if there is a live reference to that segment, otherwise there is data loss.
Inconsistency between the index, segments, and their references can occur in the file system due to hardware or software bugs. The file system automatically performs logical file verification whenever a file is closed after write operations. It traverses the metadata (e.g., segment tree) in a depth first manner to verify the above conditions. In addition to the logical file verification procedure, the file system also periodically computes the entire index checksum and compare against the segment checksum. The file system also computes the checksum of all the live references at each segment tree and then it compares against the checksum of all the segments referenced in the next segment tree level (e.g., child level). However, the file system does not have enough memory to include the actual data segments in this procedure.
There are also other means to catch data corruption, e.g. replication, locality repair, or direct user access. These are not 100% reliable mechanisms as they might not even be enabled at all and we cannot rely on them. Logical file verification traverses the segment tree of a file and verifies the consistency in a file by file basis. This segment tree depth first approach can result in very slow random disk input and output (TO). Furthermore, duplication can cause file verification to walk the same segments over and over again. Because of these issues, the current file verification can lag behind by weeks or even months.
To verify the second condition above, it is possible to enumerate all the live segment references and the segments but they might not all fit into available memory. This document describes an in-memory only solution that can fulfill requirement #2 with high probability.