There have been a number of efforts undertaken in a network environment to provide high availability to network-accessible filesystems. Such efforts have included providing a high availability storage medium(s) (e.g., RAID disc arrays) and cluster servers that can access such a high availability storage medium(s). The high availability storage medium(s) also is further configured so as to implement any of a number of storage schemes such as mirroring data on a duplicate disk (RAID level 1) or providing a mechanism by which data on one disk, which disk has become lost or inaccessible, can be reconstructed from the other disks comprising the storage medium (RAID level 5).
As shown in FIG. 1, clustered servers are generally configured and arranged so that there are redundant nodes for accessing the storage medium, for example two (2) nodes, Nodes A, B. Conceptually this feature improves data and application availability by allowing two servers or the two nodes to trade ownership of the same hard disks or RAID disk array within a cluster. When one of the servers in the cluster is unavailable (e.g., the operating system on the server fails), the operating system cluster software, such as Microsoft Cluster Server (MSCS) available with Microsoft Windows NT or Windows 2000 server, of the other still functioning node or server automatically recovers the resources and transfers the work from the failed system or server to the other operational server within the cluster. The redundant servers and operating systems thus provide a mechanism by which the client applications can quickly reconnect to the external hard disks or RAID disk array so that they can access their data and/or applications programs stored therein. As a result, the failure of one server or operating system in the cluster does not affect the other server (s) or system(s), and in many cases, the client applications as well as the remote user are completely unaware of the failure. This redundancy and automatic recovery capability in turn relates to higher server availability for users.
In addition, a past major concern with input/output (I/O) operations was the speed at which operations were being processed because most users where concerned about getting their work done as fast as possible. Thus, efforts where undertaken to improve the relative speed of I/O operations for example by the use of a cache operably located between the random access memory (RAM) or client applications and the storage medium (e.g., hard disk or disk array). Because the data can be written to the cache relatively faster than if there was a direct write-through to the storage medium and because writes from the cache to the storage medium are done typically in batch or flushing style, the apparent speed of the I/O operation(s) is improved.
In addition to I/O processing speed, the reliability of the data being stored has become an increasingly important issue relative to the speed by which a user can access data on the disk drive. Stated another way, if there is a system failure resulting in data loss or data corruption the speed by which the preceding I/O operation was performed becomes irrelevant. Two write systems, the careful write file system and the lazy write file system, do not guarantee protection of user file data. If the operating system crashes while an application is writing a file using either of these two systems, the file can be lost or corrupted. In the case of a lazy write file systems the crash also can corrupt the lazy write file system, destroying existing files or even rendering an entire volume inaccessible.
Some operating systems have been developed to include a write functionality or technique whereby no file system operations or transactions will be left incomplete and the structure of the disk volume will remain intact in the case of system failure without the need to run a disk repair utility. Such a recovery functionality, however, does not result in the recovery of and updating of user data. Rather, the recovery procedure returns the data file to a consistent state existing before the write, however in process changes written to the cache can be lost.
It thus would be desirable to provide a new operating system, computer system and methods related thereto, operating in a cluster server environment, that can recover from the failure of one cluster node or cluster server which recovery also includes the capability of updating the data file in a storage medium to include data that was not completely written (i.e., unwritten data) before the onset of the failure. It also would be particularly desirable to provide such an operating system, computer system, executable program and methods related thereto where such recovery is accomplished automatically or without user action. It also would be desirable to provide such an operating system, computer system, executable program and methods related thereto where such recovery can be effected seamlessly to the remote user and essentially without or with minimal interruption to a running or client application. Such systems preferably would be simple in construction and such methods would not require highly skilled users to utilize the systems as compared to prior art systems or methodologies.