1. Introduction and Background
Main motivations for distributed computer systems are the addition of high availability and increased performance. Distributed computer systems support a host of highly-available or parallel applications, such as Oracle Parallel Server (OPS) Informix XPS or HA-NFS. While the key to the success of distributed computer systems in the high-end market has been high-availability and scalable performance, another key has been the implicit guarantee that the data trusted to such distributed computer system will remain integral.
2. System Model and Classes of Failures
The informal system model for some distributed computer systems is that of a "trusted" asynchronous distributed system. The system is composed of 2 to 4 nodes that communicate via message exchange on a private communication network. Each node of the systems has two paths into the communication medium so that the failure of one path does not isolate the node from other cluster members. The system has a notion of membership that guarantees that the nodes come to a consistent agreement on the set of member nodes at any given time. The system is capable of tolerating failures and the failed nodes are guaranteed to be removed from the cluster within bounded time by using fail-fast drivers and timeouts. The nodes of the distributed system are also connected to an external and public network that connects the client machines to the cluster. The storage subsystem may contain shared data that may be accessed from different nodes of the cluster simultaneously. The simultaneous access to the data is governed by the upper layer applications. Specifically, if the application is OPS then simultaneous access is granted as OPS assumes that the underlying architecture of the system is a shared-disk architecture. All other applications that are supported on the cluster assume a shared nothing architecture and therefore, do not require simultaneous access to the data.
While the members of a cluster are "trusted", nodes that are not part of the current membership set are considered "un-trusted". The un-trusted nodes should not be able to access shared resources of the cluster. These shared resources are the storage subsystem and the communication medium. While the access to the communication medium may not pose a great threat to the operation of the cluster, other than a possible flooding of the medium by the offending node(s), the access to the shared storage sub-system constitutes a serious danger to the integrity of the system as an un-trusted node may corrupt the shared data and compromise the underlying database. To fence non-member nodes from access to the storage sub-system the nodes that are members of the cluster exclusively reserve the parts of the storage sub-system that they "own". This results in exclusion of all other nodes, regardless of their membership status, from accessing the fenced parts of the storage sub-system. The fencing has been done, up to now, via low level SCSI-2 reservation techniques, but it is possible to fence the shared data by the optional SCSI-3 persistent group reservations as those are, in fact, a super-set of SCSI-2 reservations. It is important to note that we assume that the nodes that could possibly form the cluster, regardless of whether they are current cluster members or not, do not behave maliciously. The cluster does not employ any mechanisms, other than requiring root privileges for the user, to prevent malicious adversaries from gaining membership in the cluster and corrupting the shared database.
While it is easy to fence a shared storage device if that device is dual-ported, via the SCSI-2 reservations, no such technique is available for multi-ported devices, as the necessary but optional SCSI-3 reservations are not implemented by the disk drive vendors. In this paper we assume that the storage sub-system is entirely composed of either dual-ported or multi-ported devices. Mixing the dual and multi-ported storage devices does not add any complexity to our algorithms, but will make the discussion more difficult to follow. It should be pointed out, however, that a multi-ported storage sub-system has better availability characteristics than a dual-ported one as more node failures can be tolerated in the system without loss of access to the data. Before we investigate the issues of availability, data integrity, and performance we should classify the nature of faults that our systems are expected to tolerate.
There are many possible ways of classifying the types of failures that may occur in a system. The following classification is based on the intent of the faulty party. The intent defines the nature of the faults and can lead the system designer to guard against the consequences of such failures. Three classes of failures can be defined on the basis of the intent:
1. No-Fault Failures: This class of failures include various hardware and software failures of the system such as node failures, the communication medium failures, or the storage sub-system failures. All these failures share the characteristic that they are not the result of any misbehavior on the part of the user or the operator. A highly-available system is expected to tolerate some such failures. The degree to which a system is capable of tolerating such failures and the affect on the users of the system determine the class (e.g, fault-tolerant or highly-available) and level (how many and what type of failures can be tolerated simultaneously or consecutively) of availability in a traditional paradigm.
2. Inadvertent Failures: The typical failure in this class is that of an operator mistake or pilot error. The user that causes a failure in this class does not intend to damage the system, however, he or she is relied upon to make the right decisions and deviations from those decisions can cause significant damage to the system and its data. The amount of protection the system incorporates against such failures defines the level of trust that exists between the system and its users and operators. A typical example of this trust in a UNIX environment is that of a user with root privileges that is relied upon to behave responsibly and not delete or modify files owned and used by other users. Some distributed systems assume the same level of trust as the operating system and restricts all the activities that can affect other nodes or users and their data to a user with root privileges.
3. Malicious Failures: This is the most difficult class of failures to guard against and is generally solved by use of authenticated signatures or similar security techniques. Most systems that are vulnerable to attacks by malicious users must take extra security measures to prevent access to such users. Clusters, in general, and Sun Cluster 2.0 available from Sun Microsystems, Inc. of Palo Alto, Calif., in particular, are typically used as back-end systems and are assumed immune from malicious attacks as they are not directly visible to users outside the local area network in which they are connected to a selected number of clients. As an example of this lack of security, consider a user that can break into a node and then joins that node as a member of cluster of a distributed system. The malicious user can now corrupt the database by writing on the shared data and not following the appropriate protocols, such as acquiring the necessary data locks. This lack of security software is due to the fact that some distributed systems are generally assumed to operate in a "trusted" environment. Furthermore, such systems are used as high-performance data servers that cannot tolerate the extra cost of running security software required to defend the integrity of the system from attack by malicious users.
Note that the above classification is neither comprehensive nor that the classes are distinct. However, such classification serves as a model for discussing the possible failures in a distributed system. As mentioned earlier, we must guard against no-fault failures and make it difficult for inadvertent failures to occur. We do not plan to incorporate any techniques in system software to reduce the probability of malicious users from gaining access to the system. Instead, we can offer third party solutions that disallow access to potentially malicious parties. One such solution is the Fire Wall-1 product by Check Point Software Technologies Limited which controls access, connection, and provides for authentication. Note that the addition of security to a cluster greatly increases the cost and complexity of communication among nodes and significantly reduces the performance of that system. Due to the performance requirements of high-end systems, such systems typically incorporate security checks in the software layer that interacts directly with the public network and assume that the member nodes are trusted so that the distributed protocols, such as membership, do not need to embed security in their designs.