Various techniques have been used for failure containment and recovery in a file server. These techniques include error handling routines, background diagnostic routines, background error scrubbing routines, redundant disk storage, redundant processors, redundant memory, redundant data paths, and redundant power supplies.
During periods of reduced activity, background diagnostic routines load memory and disk with test data and confirm that each is operating correctly. The background diagnostic routines also build a baseline of information that is used for tracking trends, and predicting and preventing failures.
During periods of reduced activity, background error scrubbing routines periodically read and re-write all areas of memory and storage in order to detect potential errors. Any small or “soft” error is detected and corrected. If a large or “hard” error is detected, the defective area of memory or storage is “fenced off” by re-mapping of the erroneous memory segment or sector. If the problem is significant for a particular memory board or disk drive, then the memory board or disk drive is taken out of service, and all usable data is copied from the memory board or disk drive to a “hot spare.”
A preferred arrangement of redundant disk storage is known as RAID-5. RAID-5 is described in Patterson et al., “A case for redundant arrays of inexpensive disks (RAID),” ACM SIGMOD International Conference on Management of Data, Chicago, Ill., 1-3 Jun. 1988, pages 109-116. In practice, a RAID-5 set of disk drives typically includes either four or eight disk drives. The storage of each disk drive is called a “physical disk.” Each disk drive contains data tracks and parity tracks. Each parity track in a disk drive contains parity computed by the exclusive-OR (XOR) of data tracks striped across the other disk drives in the RAID set, including a respective data track in each of the other disk drives. If a single disk drive in the RAID-5 set fails, then the RAID-5 set can operate with the surviving members. Each track on the failed disk drive is reconstructed from data or parity in the other disk drives of the RAID-5 set by exclusive-OR'ing the other tracks in the same stripe.
It is unlikely but possible for a RAID-5 set to return a bad sector media error in response to a request to read or write to a specified disk sector. Such an error indicates a loss of at least a sector of data because errors have occurred not only in the specified disk sector but also in at least one other sector in the same stripe across the RAID-5 set. Therefore the requested data cannot be read from the specified sector or reconstructed from sectors in the other disk drives of the RAID-5 set.
Typically, a bad sector media error is returned to the requesting application, and also reported to a system administrator. Since the error indicates a disk drive failure, the application should not continue to access the disk drive, and may be prevented from doing so by continuing to receive the bad sector media error in response to any repeated request for access. The system administrator will attempt to find a backup copy of the LUN, file system, or dataset being accessed by the application, and if a backup copy is found, the system administrator will restore the backup copy into replacement storage. The application will then be restarted manually upon the restored backup copy in the replacement storage. If the application has been maintaining a transaction log of changes since creation of the backup copy, then it will be possible to re-play the transaction log in order to restore the backup copy to the state existing at the time of the media error. In this fashion, it may be possible for the application to fully recover from the media error.