Embodiments relate generally to data storage environments, and, more particularly, to file system replication in data storage systems.
A file system is a collection of files and directories plus operations on them. To keep track of files, file systems have directories. A directory entry provides the information needed to find the blocks associated with a given file (e.g., or, typically, the directory entry includes an i-number that refers to an i-node, and the i-node includes information needed to find the blocks). Many file systems today are organized in a general hierarchy (e.g., a tree of directories) because it gives users the ability to organize their files by creating subdirectories. Each file may be specified by giving the absolute path name from the root directory to the file. Every file system contains file attributes such as each file owner and creation time and must be stored somewhere such as in a directory entry.
A snapshot of a file system will capture the content (e.g., files and directories) at an instant in time. A snapshot typically results in two data images: (1) the snapshot data (e.g., pointers, indices, metadata, etc. to record the contents of the file system at that moment in time); and (2) the active data that an application can read and write as soon as the snapshot is created (i.e., the active file system). Snapshots can be taken periodically, hourly, daily, weekly, on user demand, or at any other useful time or increment. They are useful for a variety of applications including recovery of earlier versions of a file following an unintended deletion or modification, backup, data mining, or testing of software.
A replica of a file system captures, not only the contents of files and directories, but also any other information associated with the file system. For example, if a file system has five snapshots, the replica will capture the contents of the active file system's data blocks and data relating to the five snapshots. Once a file system has been replicated, it may be desirable to verify that the replicated data is accurate. Traditional techniques for verifying a replicated file system typically traverse the file tree (e.g., the directory structure) to create fingerprints (e.g., hash checksums) of each file of both the source and replica file systems. The fingerprints can then be compared to detect any differences between the source and replicated files.
These traditional verification techniques can be limited in various ways. One such limitation is that it typically takes an appreciable amount of time and system resources to traverse the file tree. File-based traversal tends to involve non-sequential disk access and other functions. This can be resource-intensive, particularly in file systems having complex trees or large numbers of small files, or in sparse file systems, etc. Another such limitation is that the file-level verification typically cannot be made aware of inaccurate space allocations unless each snapshot of the file system is independently verified. For example, the file path may not include an indication of which blocks are allocated to which snapshots. Iterating separately over each snapshot can involve considerable amounts of redundancy and other inefficiencies.