Most computer-based processing systems today include some form of file system to manage and organize stored data. A file system is a structure of stored data and associated metadata, which is often hierarchical in nature. A file system can be based on the notion “files” as the basic units of data upon which it operates; alternatively, a file system can be based on other basic units of data instead of, or in addition to files, such as sub-file-level data blocks. Thus, the term “file system” as used herein is not limited to a system that is capable of managing data in the form of files per se.
A network storage controller is an example of a type of processing system that includes a file system. This form of processing system is commonly used to store and retrieve data on behalf of one or more hosts on a network. A storage server is a type of storage controller that operates on behalf of one or more clients on a network, to store and manage data in a set of mass storage devices, such as magnetic or optical storage-based disks or tapes. Some storage servers are designed to service file-level requests from hosts (clients), as is commonly the case with file servers used in a network attached storage (NAS) environment. Other storage servers are designed to service block-level requests from hosts, as with storage servers used in a storage area network (SAN) environment. A “block” in this context is the smallest addressable unit of contiguous data that can be addressed in a file system. Still other storage servers are capable of servicing both file-level requests and block-level requests, as is the case with certain storage servers made by NetApp, Inc. of Sunnyvale, Calif.
Almost any file system can experience occasional data errors or corruption, particularly large-scale file systems such as used in modern storage servers. The larger and more complex the system is, the more likely it is to experience such errors. Consequently, network storage systems and other types of processing systems are usually equipped with some form of tool to detect and fix (when possible) such errors in a file system. Typically the tool is implemented in software.
Examples of such a tool include the UNIX based fsck program and the chkdsk command on Microsoft Windows® based systems. These tools typically execute while the file system being verified is off-line, i.e., while the file system not available for servicing of client requests.
Some file system check tools are implemented in network storage systems. In such cases, the tool is normally invoked by an administrative user, such as a storage network administrator. Examples of the functions performed by such a tool include block allocation testing to make sure that each block is properly allocated and that pointers to the block are proper. Such a tool also typically looks for other file system corruptions, such as inconsistencies in inodes and space accounting (e.g., incorrect block counts).
At least one prior art file system check tool implemented in a network storage server can recommend to the user remedial changes to the file system to fix any detected errors, and enables the user to approve or disapprove the changes before committing them to the file system. However, this tool requires the entire storage server to be off-line while the checking and fixing processes are being executed on any data volume in the storage server. In a storage system which stores large amounts of data, such as an enterprise network storage system, these processes can take hours. It is very undesirable for data to remain inaccessible to users for such long periods of time. In addition, this tool uses a very large amount of random access memory (RAM) while it runs.
Another prior art file system consistency check tool implemented in a network storage server is able to save changes (error fixes) and status information to persistent storage (e.g., disks) while it runs, thereby consuming less RAM. It is also designed so that the volumes of the storage server remain online (i.e., accessible for servicing client requests) for at least part of the time that the tool runs. The entire aggregate which contains the volume being checked is taken offline temporarily (an aggregate is a group of physical storage), so all volumes in that aggregate will be unavailable during that period; however, the aggregate can go back online after the tool completes an initial remount phase, and all volumes become available at that time.
This tool also can determine changes that are needed to fix detected errors. However, it does not enable the user to approve or disapprove the remedial changes before they are committed to the file system; rather, the remedial changes are automatically committed. That is because this tool uses a built-in “consistency point” process in the storage server to commit the changes. A consistency point is an event (typically a recurring event) at which new or modified data buffered in RAM is committed to allocated locations in persistent storage to preserve a consistent image of the file system.