1. Field of the Invention
The present invention is generally related to fault tolerant computer systems and, in particular, to computer systems that provide fault tolerant network filesystem (NFS) data serving.
2. Description of the Related Art
With the increasing reliance on computers throughout society, increasing emphasis is placed upon providing computer systems with insured availability. In general, such systems are referred to as fault tolerant computer systems. There are a number of quite disparate system implementations of fault tolerance, largely due to how critical operations are identified. What constitutes a critical operation is, in turn, dependant on the particular type of tasks required of such computer systems. Consequently, the particular construction of fault tolerancing features implemented in any given computer system may and generally does take significantly different architectural forms.
Several common fault tolerancing considerations exist with regard to most if not all of the disparately architected computer systems. These considerations include minimizing performance loss due to the addition of fault tolerance capabilities, reducing costs and maximizing the utilization of hardware and software in a fault tolerant system, minimizing administration of the fault tolerant features, limiting the impact of the implementation of fault tolerant features on the end users of the services provided by the computer system and, finally, reducing the practical significance of architectural constraints on the expendability or modular enhancement capabilities of the computer system.
One form of fault tolerance in computer systems is known as intra-server fault tolerance. This form of fault tolerance applies to a unified computer system utilizing multiple central processor units (CPUs) operating internally in a mutually redundant fashion. A dedicated error detection control subsystem is provided typically at the CPU base level to identify any operational discrepancies in the programs executed synchronously by the redundant CPUs. A malfunctioning CPU can thus be identified and functionally eliminated from further participation in providing computer system services. The remaining CPU or CPUs may then continue to provide all requested services, pending identification and correction of the faulty CPU.
Conventional intra-server fault tolerant computer systems, though providing a high degree of reliability, inherently embody two significant disadvantages. These disadvantages include a requirement of a highly constrained or specialized architectural design and a high cost due to substantial duplicated hardware. Offsetting these disadvantages is the very high degree of reliability obtained by insuring correct operation down to an essentially machine cycle level of operation.
Another known form of fault tolerance is obtained through the close coupling of computer systems at the data subsystem level. A typical architecture employing such an inter-server approach utilizes redundant pair of computer systems cross coupled through dual ported disk subsystems. Each system is thus permitted to perform as a master initiator of disk data transfer requests. Should either of the computer systems fail, to the exclusion of the corresponding disk subsystem, the remaining computer system can have complete access to the data of the failed system.
The disadvantages of this form of fault tolerance is that there is little protection against the corruption of data written by a failing system and exposure to a failure occurring in either of the disk subsystems. There is also the significant additional cost of providing and maintaining dual-ported disk subsystems on all of the fault tolerance protected computer systems. An architectural limitation is also present in that dual-porting only allows paired systems to be mutually protected. Also, the paired systems must be mutually local, thereby requiring single site survival as a prerequisite to fault tolerant operation. Finally, a substantial performance penalty is involved at the point of any failover. Since the two computer systems only have port-based access to the disk subsystem of the other computer system, the disk subsystem of the failed host must be first cleaned and then remounted by the remaining host. Where large disk subsystems are in use, this process may take several hours.
Finally, another approach for providing fault tolerance occurs in networked computer environments. This approach operates by use of network filesystem shadowing and may be generally implemented as either a client initiated or a server-only compatibility. Client initiated shadowing requires that the client computer system send all network operation requests to two or more pre-established remote server systems. This approach to fault tolerance allows each client to independently insure the identity of the client data in two or more physically separate data storage locations. The client is, however, burdened with the responsibility of monitoring and analyzing the completion of all such duplicated requests. The client must be able to resolve both conventional network transmission errors as well as the loss of access to any of the remote network server systems.
A clear disadvantage of client initiated shadowing is, of course, the requirement for modified operating system software to be executed on each fault tolerant protected client. This approach also admits of inconsistent administration of the client shadowed data storage areas on remote servers by requiring each client to define and control the areas of fault tolerance. Client initiated shadowing also shifts the responsibility for responding to shadowing errors, and their correction, to the user of the client.
The known use of NFS shadowing at the server system level relies on delayed writes of shadowed data from a primary to a secondary server system. NFS server level shadowing thus requires only the real-time logging of all data modifications stored on one server to be replicated to at least the second server. The inter-server transfer of such logged data is performed as a low priority background task so as to have minimal impact on the normal function and performance of the primary server system. Even so, the delayed background transfer of logged data from the primary to backup server system may consume a substantial portion of the network resources of a primary server. Another problem with NFS server shadowing is that, at the point of any failover, the delayed write of logged data to the surviving system creates an exposure window for the loss of data.
Consequently, in view of the necessity of client software modification and user administration in client based NFS shadowing and the performance penalties and real potential for data loss in server based NFS shadowing, the use of NFS based fault tolerant systems has not been generally well received in the industry where data fault tolerance is a requirement.