Over the past few years computer systems have become larger and larger, with more processors, more interconnections, and more physical memory being incorporated therein. The increasing size of the computer systems has led to a reliability problem that greatly affects system operation.
This problem has two significant factors, the first is hardware unreliability and the second is software unreliability. Regarding the hardware aspects of the problem, the probability that any one piece of hardware will fail in the system, increases as the number of pieces of hardware in the system increases. This is due to the large size of the system, as well as the increasing complexity of hardware.
Regarding the software aspects of the problem, the software has become unreliable and tends to fail unexpectedly. This is because the systems are becoming faster, and the programs are becoming larger and more complex, which leads to an increase in the occurrence of software failures. The higher clock rates of the systems allow the software to run faster. Since more lines of code are being run in a given period of time, then the system is more likely to execute buggy code, thus causing the system to fail simply because it is running faster. Moreover, since the systems are running faster and have more memory, programs have become larger and more complex, thus increasing the likelihood that buggy code exists in the programs. For example, assume a certain defect rate of one bug in every 10,000 lines of code. Thus a 100,000 line program would have 10 bugs, while a 30,000 line program would have only 3 bugs. Furthermore, because the code is running more quickly, the system is more likely to run across those bugs. Thus, the problems of hardware unreliability and software unreliability are two obstacles in building large systems.
A prior approach to solving this problem is to divide up a large system into domains or nodes. The concept, known as clustering, contains failures in the system, such that if there is a failure, either hardware or software, in one node or domain of the system, the damage that is done is limited to the domain in which the fault occurs. For example, if there is a large system that is divided into ten domains, and an error occurs in one of domains of such severity that it brings down that domain, then the remaining nine domains would continue to operate. A domain could have a single processor or many processors.
A domain is a subset of the hardware that is isolated from the remainder of the system with some fail-safe mechanisms that prevent failures within a failed domain from spreading into the other domains and crashing the entire system. A domain oriented system uses software partitioning mechanisms such that the software is aware of the domains and uses certain protocols to contain software faults to within a domain. These software mechanisms are complex and expensive to implement, and invasive in the sense that they require changes to many parts of the operating system, and require that operating system engineers be cognizant of the unreliability of the underlying hardware.
Traditional clusters use message-based communications between the domains or nodes, and thus require software to ensure reliable communications over potentially unreliable links (e.g. transmission control protocol/internet protocol or TCP/IP), and to maintain some degree of coherence between the software resources on different nodes (e.g. network file system or NFS).
The clustering approach provides good isolation between the domains or nodes, and therefore good overall system availability. However, this approach has the disadvantages of increased costs from complexity, lower performance (since applications using resources spanning several nodes must compete with the hardware and software overheads of internode communications), and a lack of a single system image.
Another approach is to modify the clusters for memory sharing among the nodes. This moves the cluster software mechanisms for resource coherence and communication down into the cache-coherence hardware, and thus exploits the existing hardware capabilities more fully, while simplifying the software and improving performance. However, this reintroduces the problems of poor reliability from hardware and software failures, discussed earlier.
There are several approaches to resolving the above problems. One way is to have a very reliable system, and endure the performance sacrifices. This approach divides the system into independent subsets of processors and memories, and treats each of the subsets as though they were separate systems. This approach, taken by Sun in its initial offering of Ultra Enterprise 10000 multiprocessor (64 CPUs), allows existing cluster software solutions to be used, but fails to capitalize on the potential of the shared memory. Moreover, this approach does not present a single system image, as by definition, the large system has been separated into smaller, independent systems. This approach also makes system maintenance, such as granting user accounts, privileges to the different users, maintaining file systems, system accounting, maintaining applications, and updating applications, extremely difficult. Each subset would have to be individually maintained.
Another approach is exemplified by the Solaris MC system, as detailed in the paper entitled "Solaris MC: A Multi-Computer OS" by Khalidi et al. of Sun Microsystems Laboratories, Technical report SMLI TR-95-48, November, 1995. In this approach, the system is partitioned into subsets and relies on traditional message-based communication between the subsets, using shared memory as the transport mechanism rather than a network. This approach has the advantage of being able to use existing cluster-based operating system code with improved performance, since the shared memory provides better performance than the network transport. However, this approach fails to exploit the potential for sharing memory objects between subsets. Solaris MC mitigates this weakness somewhat with the "bulkio no-copy" protocol, which allows for somewhat more efficient performance, but still requires explicit use of such mechanisms as the PXFS coherency protocol which introduces more complexity.
Solaris operates in a similar manner to the traditional clustered approach where the software domains are very loosely coupled and use messaging exclusively for communication between the nodes or domains. Since the operating system itself does not really share memory between the domains, Solaris overcomes the reliability problem by simply not allowing sharing of memory between domains. However, this approach does not overcome the other problems of limited system performance and the lack of a single system image. Essentially, Solaris uses messaging to gain high system reliability, and modestly increases system performance by using shared memory as the transport mechanism to carry the messages between the nodes, as compared with a network transport mechanism. This approach also suffers some of the maintenance problems as the Sun Ultra Enterprise system, as discussed above.
Another approach is exemplified by the Hive system, as detailed in the paper entitled "Hive: Fault Containment for Shared-Memory Multiprocessors", by Chapin et al., Stanford University, http://www-flash.stanford.edu. The Hive operating system exposes more of the coherent shared memory to the operating system and applications, but requires a set of protocols to overcome the unreliability of the system. These software protocols segment the domains and contain software and hardware faults, and thus compensate for the unreliability of the underlying hardware and software. These protocols include nontransparent proxies, which are proxies that the operating system programmer must be aware are being used.
Essentially, the Hive uses shared memory between the different domains or nodes, and imposes the set of protocols to prevent the system from crashing if a failure occurs in one part of the shared memory, one processor in another domain, or software in another domain. Thus, the Hive system has good reliability because of the protocols, and good performance characteristics because it uses shared memory.
However, Hive has a problem in that the protocols introduce a high level of complexity into the operating system. These mechanisms are distributed throughout the kernel or core of the operating system, consequently requiring kernel programmers to be aware of the hardware fault mechanisms, as well as which data structures are sharable and which protocol is necessary to share them. For example, RPC is difficult when the call involves locking resources in the remote node at interrupt level. Distributed data structures may be used for anonymous page management, however, they require the programmer to be aware that the "remote" parts of those structures could be read but not written. Also, the "careful" protocol for reading remote nodes requires agreement as to the meanings of tags and consistency between two different nodes.
Consequently, Hive requires that all parts of the operating system be aware of the structure of the hardware, particularly that it is divided into domains, where data resides, and to be aware of when these different protocols must be invoked. Thus, the whole operating system has to be rewritten and extensively modified to accommodate these protocols. Moreover, the problem of using the Hive protocols is not just limited to the difficulty of initial modification, but also involves the larger issue of maintenance. Since these protocol mechanisms are distributed throughout the operating system, then the operating system is more complex and more difficult to maintain. Each programmer charged with maintenance of the operating system must know and understand how the protocols interact with the rest of the operating system.
Therefore, there is a need in the art to have a large scale computer system that uses shared memory between the different domains, and thus has good performance characteristics, as well as good reliability from using failure containment mechanisms, and yet is relatively easy to maintain, while presenting a single system image to structures above it.