The use of secondary storage systems to provide for online storage for computer processing systems that is separate from the primary or main memory of the computer processing system is well known. Examples of current secondary storage systems include magnetic disk drives, optical disk drives, magnetic tape drives, solid state disk drives and bubble memories. Typically, secondary storage systems have much larger memory capacities than the primary memory of a computer processing system; however, the access to data stored on most secondary storage systems is sequential, not random, and the access rates for secondary storage systems can be significantly slower than the access rate for primary memory. As a result, individual bytes of data or characters of information are usually stored in a secondary storage system as part of a larger collective group of data known as a file.
Generally, files are stored in accordance with one or more predefined file structures that dictate exactly how the information in the file will be stored and accessed in the secondary storage system. In most computer processing systems, the operating system program will have a file control program that includes a group of standard routines to perform certain common functions with respect to reading, writing, updating and maintaining the files as they are stored on the secondary storage system in accordance with the predefined file structure that organizes the storage of both control information and data information. As used within the present invention, the term file system will refer collectively to the file structure and file control program.
One of the problems with secondary storage systems is how to insure the integrity of files stored on the secondary storage system in the event of an unscheduled hard stop of the computer processing system. An unscheduled hard stop occurs, for example, when a power failure causes a system crash or when the computer processing system must be powered down due to an unexpected reset. The problem of file integrity arises because of the access latency between the time that a user program issues an update request for a file, for example, and the time that the data for the file and the control structure for the file are actually written to the secondary storage device. If an unscheduled hard stop of the computer processing system occurs any time during this access latency window, the validity of the data stored in that file is called into question. Depending upon exactly when the unscheduled hard stop occurs during a file access, the file as stored on the secondary storage system may reflect the file as it existed prior to the update request, after completion of the update request, or in some state of partial completion of the update request. In the event that the hard stop occurs during the updating of the control information for the file structure, it is also possible that the control information for that file, or even the control information for that file tree pointing to any number of files stored on the secondary storage system may have been corrupted as a result of the unscheduled hard stop.
The traditional mechanism for insuring data integrity in the event of an unscheduled hard stop is to maintain a transaction log of all database files, for example, as described in U.S. Pat. Nos. 5,095,421, 4,945,474 and 4,530,054. In the data recovery system described in U.S. Pat. No. 4,530,054, for example, a time stamp is generated with each write command as a mechanism to log data transactions between cache memory and the bulk memory of the secondary storage devices. The primary problems with maintaining a transaction log are that recovery of the file system can be a complicated and lengthy process for file systems having a large number of files or records, and that the transaction log may not provide protection against corruption of the control information for the file system.
The other common technique for insuring data integrity is to provide a redundant, fault-tolerant system using hardware, as shown for example in U.S. Pat. Nos. 4,819,159 and 5,155,845, or software, as shown for example in U.S. Pat. No. 5,165,031. Such redundancy techniques are necessarily more expensive and more complicated and, hence, are only desirable in those situations where data integrity is of the utmost importance for a particular computer processing system.
In a UNIX.RTM. System V file system, the problem of control information integrity in the event of an unscheduled hard stop is especially acute. Unlike DOS-based systems which write control information only to the secondary storage devices, the file control information in a System V-based file system is cached in memory thereby increasing the access latency window between the updating of the control information and the writing of the updated control information to the more permanent secondary storage device. In addition, System V-based file systems lack standard sync points for writing control information, thereby creating multiple indeterminate windows of opportunity for corruption of the control information of the file structure.
In order to recover from the possible corruption of control information, the System V-based file systems use an fsck command that verifies the control information and the directories in the file system after an unscheduled hard stop. The fsck command bypasses the standard file access methods and compares the directory and control information in an effort to identify any disk blocks or control structure known as inodes that are not accounted for within the file system. For example, if there are inodes that are set to indicate associated files but no file name entries appear to exist, the fsck command identifies these files in a lost and found directory for the system administrator to identify and repair. It will be apparent that the repairing of files in the lost and found directory can be a time and labor intensive process. In another example, if there are file entry names in a directory that are not associated with an inode, the fsck command "repairs" the inconsistency simply by eliminating the file. Another disadvantage of using the fsck command is that it can be a time and processor intensive task to recovery from an unscheduled hard stop if there are a large number of files and inodes to compare. In addition, the fsck command provides no redundancy checking and lacks any mechanism to pinpoint the occurrence of the unscheduled hard stop.
The problem of file integrity in the event of an unscheduled hard stop is compounded for data servers storing remote files for a distributed network environment where the remote file systems may be accessed by any number of user nodes on the network. In this situation, not only is the access latency increased due to the fact that the remote files must be transferred across the network to the data server, but the possibility of multiple users accessing the same remote file must also be taken into account. In such a distributed network environment, it is also more difficult to implement traditional logging or redundancy techniques for insuring file integrity because of the lack of a central controller to implement file recovery procedures. The demands of file availability on the network may also preclude the time that would otherwise be required to insure file integrity in the event of an unscheduled hard stop of a network data server if traditional transactional log recovery or fsck procedures are used to recover and verify the remote file systems stored on the data server.
Correspondingly, the problem of control information integrity is also compounded for remote files stored in a distributed network environment and accessed by any number of users on the network. Control information may be cached in multiple locations thereby increasing the frequency and duration of the access latency windows and opportunities for control information corruption. Multiple user nodes may be accessing file inodes or directories at any given time, exponentially increasing the possibility of inconsistencies existing after an unscheduled hard stop of the file system.
Although conventional techniques for file recovery are adequate for recovering local files stored on secondary storage systems directly connected to a computer processing system, such techniques are not well suited to handle file recovery for data servers in a distributed computer network environment. Consequently, it would be advantageous to provide a method and apparatus for file recovery for secondary storage systems that was capable of insuring data integrity of control information and providing for fast and reliable data recovery of files stored on secondary storage systems, including remote files stored on networked data servers upon restart after an unscheduled hard stop.