The invention relates to a file server system of the kind tolerant to software and hardware failures.
Reliability of a file server system is a measure of the continuity of failure-free service for a particular system and in a particular time interval. Related to this is mean-time-between-failures which defines how long the system is expected to perform correctly.
Availability measures the system""s readiness to serve. One definition of availability is the percentage of time in which the system performs correctly in a given time interval. Unlike reliability, availability depends on system recovery time after a failure. If a system is required to provide high availability for a given failure model, i.e. for a defined set of possible failures, it has to provide fast recovery from the defined set of failures.
A number of existing file servers provide an enhanced level of availability for some specific failure models. Such file servers are sometimes referred to in the art as highly-available file servers. The mechanisms which are often used for this are based on some or all of the following:
(1) primary/back-up style of replicated file service;
(2) the use of logging for faster recovery, the log being kept on disk or in non-volatile memory;
(3) checksumming to protect data integrity while the data is stored on disk or while it is being transferred between the server""s nodes; and
(4) reliable group communication protocols for intra-server communication.
It is an aim of the present invention to provide a file server system which is tolerant to software and hardware failures.
Particular and preferred aspects of the invention are set out in the accompanying independent and dependent claims. Features of the dependent claims may be combined with those of the independent claims as appropriate and in combinations other than those explicitly set out in the claims.
According to a first aspect of the invention there is provided a file server system for storing data objects with respective object identifiers and for servicing requests from remote client systems specifying the object identifier of the requested object. The system comprises a file store for holding stored objects with associated object identifiers. The system further comprises a signature generator for computing an object-specific signature from an object, a signature checker comprising a signature store for holding a previously computed signature for each of the stored objects, and a comparator operable to compare, on the basis of a specified object identifier, a signature retrieved from the signature store with a corresponding signature computed by the signature generator from an object retrieved from the file store.
The location of the signature generator may be associated with the file store and the location of the comparator may be associated with the checker. If the file store is replicated, a signature generator may be provided at each file store replica location. Similarly, if the checker is replicated, a comparator may be provided at each checker replica location.
Signatures computed at the time of object storage are thus archived in the checker for later reference to provide an independent record of the integrity of the data stored in the file store. When an object is retrieved from file store, a signature for it can be computed by the signature generator and compared with the archived signature for that object. Any difference in the respective signatures will thus be an indicator of data corruption which can then be acted upon according to defined single point failure procedures, for example.
In the first aspect of the invention, the system preferably has an operational mode in which a decision is made as to whether to perform a comparison check in respect of an object on the basis of profile information for that object. Profile information may be supplied with the request being serviced and may be stored for each object or for groups of objects in the file store with profile information supplied with the request taking precedence.
According to a second aspect of the invention there is provided a file server system for storing data objects with respective object identifiers and for servicing requests from remote client systems specifying the object identifier of the requested object. The system is constituted by a plurality of replicable components which may or may not be replicated in a given implementation or at a particular point in time. The replication is preferably manageable dynamically so that the degree of replication of each of the replicable components may vary during operation. Alternatively the replication levels may be pre-set at the level of the system administrator.
Replication is handled by a replication manager. The replication manager is configured to allow for nodes leaving and joining the system by respectively reducing and increasing the number of replicas of each of the replicable components affected by the node transit. A failure detector is also provided. The failure detector is not replicable, but is preferably distributed over the system nodes by having an instance running on each node. The failure detector has an object register for storing a list of ones of the system objects and is configured to monitor for failure of any of the system objects listed in the object register and, on failure, to report such failure to the replication manager. For each system object on the failure detector list, there may be stored a secondary list of other ones of the system objects that have an interest in the health of that system object. The failure detector is then configured to report failure of that object not only to the replication manager but also to each of the objects on the secondary list. The replication manager preferably records for each of the replicated components a primary of the component concerned and is configured to select a new primary when a node hosting a primary leaves the system.
For enhanced reliability and availability, the file store is preferably replicated with a replication level of at least two, i.e. with a primary copy and at least one back-up copy. Another system component which may be replicable is a checker. The checker has a signature store for holding object-specific signatures computed for each of the objects stored in the file store.
A logger may also be provided to allow faster recovery in respect of nodes rejoining the system, for example after failure. The logger may also be replicated. The logger serves to maintain a log of recent system activity in non-volatile storage which can be accessed when a node is rejoining the system.
In the preferred embodiment, the file server system is located over a plurality of nodes, typically computers or other hardware elements. For operation, the file server system is connected to a network to which is also connected a plurality of client apparatuses that may wish to access the data stored in the file server system. The nodes of the file server system act as hosts for software components of the file server system. Several of the software components can be replicated. The replicable software components include: the system file store, a checker and a logger. The functions of these components are described further below. A replicated component has one primary copy and one or more back-up copies. Among the replicas of a given component, the primary may change through a process referred to as primary re-election, but there is only ever one primary at any one time for a given component. Generally it is desirable for reliability that replica copies of a given replicated component are each located at different nodes, or at least that the primary and one of the back-ups are located on different nodes. Thus, a given node may be host to the primaries of several different software components and to several back-ups. Location and handling of replica copies of a given replicable component is under the control of a replication manager which is a (non-replicable) software component of the file server system. The replication manager is distributed, meaning it can have one of its instances running on each node of the file server system. These instances inter-communicate to maintain coherence. Several or all of the nodes may be provided with direct network connections to the clients to provide redundancy. The network connections may be private or public.
The nodes are preferably loosely coupled. Each node is preferably provided with local storage, such as disk storage, and redundant network connections, for example a dual connection with the other nodes and a dual external connection to the network for client communication. The file server system can be implemented without any shared storage which has the advantage of making it possible to provide higher tolerance to failures.
In one embodiment of the invention, a file server system is provided which is tolerant to single point hardware and software failures, except partitioning failures. Protection can be provided against hardware component failure of a whole node, a cable, a disk, a network interface or a disk controller, and software component failure of the operating system or the file server enabling software. Software failure types for which protection can be provided includes crash, timing and omission failures, and data corruption failures internal to the file server system. All these hardware and software failures are assumed to be transient. By basing the design of the file server system on a single point failure model, as in this embodiment, the file server system performance can be improved, but there is the proviso such a system cannot handle simultaneous failure of more than one component.
In operation, a file server system of an embodiment of the invention services a write request received from a remote client and containing an object and an associated object identifier as follows: An object-specific signature is computed from the object. The object is stored in a file store together with the object identifier and the computed object-specific signature is stored in a further file store together with the object identifier. It will be appreciated that there is flexibility in the order in which these steps may be carried out. For example, the object may be stored before or after signature computation, or concurrently therewith. The file stores for the object and signatures are preferably located on different system nodes to enhance reliability. The stored signatures can be used in later checking processes whenever an object associated with a stored signature is accessed. The checking process involves performing a comparison between the stored signature retrieved from archive and a newly computed signature generated from the object retrieved from archive. For example, the file server system of this embodiment of the invention services a read request received from a remote client and containing an object and an associated object identifier as follows: In response to receipt of a read request relating to an object and specifying an object identifier for the requested object, the requested object is retrieved from file store on the basis of the object identifier and an object-specific signature is computed from the retrieved object. Concurrently, beforehand or subsequently, the archived signature for the object is retrieved from the signature file store, also on the basis of the object identifier. The newly computed signature is then compared with the old signature retrieved from archive and the subsequent request servicing then proceeds on the basis of the comparison result according to a pre-specified algorithm which may follow from a single point failure model or a multi point failure model, for example. As will be appreciated and as is described further below, the read and write request servicing algorithms can be extended according to the degree of replication of the file stores.