In modern computer networks, a storage server can be used for many different purposes, such as to provide multiple users with access to shared data or to back-up mission critical data. A file server is an example of a storage server which operates on behalf of one or more clients to store and manage shared files in a set of mass storage devices, such as magnetic or optical storage based disks or tapes. The mass storage devices are typically organized into one or more volumes of Redundant Array of Independent (or Inexpensive) Disks (RAID).
A file server may include several core components, including, one or more processors, a system memory, and a communications bus. The communications bus may be a Peripheral Component Interconnect (PCI) bus to connect the core components to one or more peripheral components on the PCI bus. Connected to the PCI bus, may be for example, a non-volatile random access memory (NVRAM), one or more internal mass storage devices, a storage adapter, a network adapter, a cluster interconnect adapter, or other components.
In conventional systems, to protect against a failure of the file server, an approach called clustered failover (CFO) has been used, in which a source file server and a destination file server operate as “cluster partners.” The source file server and the destination file server are connected by a high-speed cluster interconnect. In the event of a failure in either the source file server or the destination file server, storage operations are taken over by the other file server in the cluster pair, to service further requests from a client. Having complete redundancy of the storage server, including core components such as processors and memory, can be a prohibitively expensive solution in certain cases.
Conventional systems also have the weakness of not being able to survive input/output (I/O) errors. Any error, such as a PCI error, can bring the file server down requiring the transfer of operations to the other server in the cluster pair. Such system failures are sometimes unnecessary, however, considering that not all PCI devices are essential for system survival. Furthermore, not all PCI errors are the result of hardware failures. Sometimes an error is caused by an intermittent driver problem and reinitialization is a possible solution. Additionally, failures can also be a combination of software interaction that triggers a hardware problem. Finally, there may be hardware failures that are permanent. It may not be easy to distinguish between these different cases and determine when complete transfer of storage operations is necessary or when device reinitialization is possible.