Various forms of network-based storage systems exist today, including network attached storage (NAS), storage area networks (SANs), and others. Network storage systems are commonly used for a variety of purposes, such as providing multiple users with access to shared data, backing up critical data (e.g., by data mirroring), etc.
A network-based storage system typically includes at least one storage server, which is a processing system configured to store and retrieve data on behalf of one or more client processing systems (“clients”). In the context of NAS, a storage server is commonly a file server, which is sometimes called a “filer”. A filer operates on behalf of one or more clients to store and manage shared files. The files may be stored in a storage subsystem that includes one or more arrays of mass storage devices, such as magnetic or optical disks or tapes, by using RAID (Redundant Array of Inexpensive Disks). Hence, the mass storage devices in each array may be organized into one or more separate RAID groups.
In a SAN context, a storage server provides clients with access to stored data at a sub-file level of granularity, such as block-level access, rather than file-level access. Some storage servers are capable of providing clients with both file-level access and block-level access, such as certain Filers made by Network Appliance, Inc. (NetApp®) of Sunnyvale, Calif.
In essentially any computing system or data storage system, data can become corrupted or inconsistent with its associated metadata. This is true even for sophisticated, enterprise-level storage servers, which typically employ fairly robust error detection and correction techniques, such as forms of RAID. Although certain levels of RAID provide error detection and correction, data can occasionally become corrupted or inconsistent in a way that may be too severe for RAID software to correct. An example of this is certain types of double disk failure, such as where a second disk fails during an attempt to recover a first failed disk.
A problem in designing storage servers is how to handle this type of data error or inconsistency. Many if not all storage servers will simply “panic” when they try to read corrupted or inconsistent data from disk. A panic is when the storage server is unable to continue operating normally, and has to shut down or reboot. A panic frequently also involves a “core dump” prior to shutdown or reboot. The term core dump refers to the creation of a file which represents the complete, unstructured state of the working memory of the storage server at the time of a panic. The file, which is typically called a “core file”, can be transmitted to a remote computer associated with a customer support group at the manufacturer of the storage server, just prior to shut down during a panic, or immediately upon reboot afterwards.
In many applications, a panic can be much less desirable than occasionally encountering corrupted or inconsistent data. For example, a panic may require client/user sessions to be reset, which may result in users losing important data. The down time associated with a panic can also be extremely costly and undesirable, especially in large-scale (e.g., enterprise-level) storage systems. Furthermore, panicking is usually not a desirable way to handle data corruption or inconsistency, since the system will likely just panic again the next time it attempts to access the faulty data.