The present invention relates generally to distributed or cluster computing systems and processes. More particularly, the present invention relates to fault tolerant, scaleable, cluster computing systems and processes operating within given time constraints.
Cluster computing represents a compromise between the manageability, power, and ease of use of centralized uniprocessor systems and the reliability, fault-tolerance, and scalability of distributed systems. A cluster comprises a set of workstations or personal computers connected by a high-speed local area network. Thus, a cluster replaces a uniprocessor with a set of commodity computers meshed by a software backplane (that is, the logical software equivalent of a wired backplane wherein all the electronic printed circuit cards or modules are interconnected). A cluster is almost as easy to manage as a uniprocessor because all the members of the cluster are homogeneous and centrally administered. Moreover, it is easy to use because a cluster appears to users as a single powerful computer with a single file system and a single set of applications. Nevertheless, being a distributed system, a cluster offers the power, scalability, reliability, upgradability and fault-tolerance characteristic of such distributed systems. These characteristics give clusters a great advantage-over uniprocessor systems.
While cluster computing systems and methods are known, they generally lack the ability to meet time to response bounds.
Providing both fault tolerance and guaranteed completion time properties simultaneously is not trivial. The traditional approach to fault tolerance is to use a tightly coupled hardware based fault-tolerant computer systems, such as the ones manufactured by Stratus(trademark) and Tandem(trademark). The hardware approach suffers from at least three substantial problems. First, although this approach allows transparent masking of hardware faults, the system cannot tolerate software failures, which remain a source of downtime in many critical settings. Second, administrators cannot perform xe2x80x98hotxe2x80x99 (upgrading while the system is running) upgrades to software or hardware on such systems. Third, fault-tolerant hardware often lags the price/performance curve of commodity computers by several years.
The above tightly coupled hardware fault-tolerant computers have an advantage, however, of preserving the response characteristics of applications executed upon them. If an application is designed to respond to some class of requests within a time bound, for example 100 ms, a tightly coupled hardware based fault-tolerant platform will preserve that response time. In contrast, prior art distributed cluster fault-tolerance solutions are slow to respond-in general, and such systems are often slow to detect and react to failures, so that they rarely meet time bounds, especially in the presence of failures. For example, a classical solution would be to require applications to execute a Byzantine agreement protocol (highly complex and redundant) to mask software faults, imposing a significant computational and communication burden on the cluster.
Another limitation of prior art systems is their inability to scale to accommodate larger numbers of networked computers (QE""s). These networked computers are used in conjuction with an External Adaptor (EA), or front end computer, which connects to an external communications network on one side and the networked computers in parallel (either by a bus or separate communication lines) on the other side. Scaling is important since the telecommunications industry hopes to build memory-mapped databases containing hundreds of millions of subscriber records. This scalability limitation is illustrated for example in cluster systems implementing a telephone switching (SS7) protocol where external requests are to be uniformly distributed among the QE""s. The EA handles incoming requests that are batched (if large numbers of requests are received) for processing at the QE""s. Thus, the workload on the EA rises in direct proportion to the number of QE""s. When a protocol, like SS7, is implemented timing restrictions apply. In such an instance, these timing requirements act to limit the number of QE""s that can be handled by an EA to 8 to 12 QE""s, where handling fifty or more QE""s may be required.
Herein, and in the art, several terms are used interchangeably. The xe2x80x9cclusterxe2x80x9d refers to the entire distributed computing system connected to a external communications network, e.g. the Internet, an Ethernet. Cluster will be the term of choice. The parts of the cluster that connect directly to the external communications network are called. the xe2x80x9cfront endxe2x80x9d, or the external adaptors, or EA computers or EA""s . Hereinafter, EA will be the term of choice. The part of the cluster that performs the computation and other tasks, is referred to as the xe2x80x9cback endxe2x80x9d, or the networked computers, or the query elements, or QE computers, or QE""s. Hereinafter, QE will be the term of choice. Another term, xe2x80x9ctime-delay constrained,xe2x80x9d is interchangeable with xe2x80x9ctime delay bound,xe2x80x9d xe2x80x9ctime to respondxe2x80x9d, and other combinations, but the meaning of the terms will be clear from the context to those skilled in the art.
Herein, alternate terms listed above may be used, or other terms, such as xe2x80x9cgroupxe2x80x9d may be used, wherein such use will be clear from the context.
It is therefore an object of the present invention to provide a cluster computing system and method that is fault tolerant.
It is another object of the present invention to retain the advantages of cluster computing, while gaining the fault-tolerance and timely responsiveness of a uniprocessor and/or hardware system solution.
It is yet another object of the present invention to provide a fault tolerant cluster computing system and method that completes a response or computation even if one or more of the components of a cluster fails.
It is still another object of the present invention to provide a fault tolerant system and method that is scaleable.
The present invention meets the foregoing objects in apparatuses and methods for designing practical, reliable, time delay-constrained (time bound) cluster computing systems, in which a limited number of EA""s (typically two) interact with the external, outside world, directing clients requests to the QE""s, and then relaying the replies back from the QE""s to the clients.
Cluster computing is naturally suited to systems that perform large numbers of independent (or nearly independent) small computations. The cluster is partitioned into one or more EA""s and multiple QE""s. The EA isolate and hide the rest of the cluster from the external network and provide load balancing. Fault-tolerance requires replication of EA functionality.
In a preferred embodiment of the invention, a client computer contacts one of the EA""s with a request via a communications network. The EA forwards the request to one of the QE""s where a reply is generated for sending back to the client via the EA. The QE selected is determined by: the capabilities of computers that comprise the QE""s, the current load distribution on the QE""s, and the expected time bound for handling the request. With a reliable time delay-bound cluster, the reply is generated within a time delay bound despite failures in the computers comprising the EA""s and the QE""s.
In another preferred embodiment of the present invention, the EA""s communicate with the outside world, with each other and with the QE""s. Also means are provided for the QE""s to communicate with each other. In this embodiment, the QE""s are logically divided (that is the QE""s are not physically separated) into at least two sets of listsxe2x80x94one set of lists for each EA. A list is a grouping of a number of QE""s. Each of the lists within a set is non-overlapping within that set such that each QE appears only in one list within a set of lists. The sets of lists are arranged such that, when comparing a list from one set with a list from the other set, the overlap contains at most one QE. This arrangement provides at least two logically distinct routing paths between the EA""s and every QE.
In another preferred embodiment the lists are selected by calculating the closest integer square root above and below the number of QE""s. For example, with two EA""s and twenty QE""s, the closest integer square root below twenty is four (the square being sixteen), and the closest integer square root above twenty is five (the square being twenty-five), thus bracketing twenty. The result is that the twenty QE""s are divided into two sets of lists. One set has four lists with five QE""s in each list and the other has five lists of four QE""s in each list. However, in other preferred embodiments, some overlapping among the lists may be used.
Four specific aspects of designing such clusters are addressed by the present invention and preferred embodiments thereof by: a) providing bounded response time in the presence of failures, b) achieving high-throughput, c) scalability, and d) managing the cluster. Cluster management is accomplished by communicating with the clusterxe2x80x94EA""s and QE""s, as a whole (group communication). This group communication keeps track of membership changes (new and/or discontinued EA""s and QE""s), detects failures in the cluster, and automatically reconfigures the system when new QE""s or EA""s are added. However, in order to maintain good performance, group communication is used only for control messages, and is kept out of the time delay bound path of the request and reply messages. Time delay bound response is achieved by using a primary-backup approach, in which the backup is activated after half of the allowed time has expired without a reply rather than waiting for the system to detect the failure. In order to implement this primary/backup approach, there are at least two EA""s, with one backing up the other. In a similar manner if a QE does not respond in half the allowed time, the EA will send the request to another QE.
An advantage of the present invention is based on the observation that, in prior art clusters, the EA""s can be a bottle-neck to adding (scaling up) QE""s to clusters due to the overhead associated with the use of many QE""s. The present invention achieves high-throughput and scalability by combining message batching techniques with data dissemination algorithms that push some of the communication overhead from the EA""s to the QE""s.
A further advantage of the present invention is the ability to meet a time-to-respond requirement in the presence of one or more cluster component failures. The present invention substantially guarantees the completion time of replies to requests, even in the presence of failures.
Other objects, features and advantages will be apparent from the following detailed description of preferred embodiments thereof taken in conjunction with the accompanying drawings in which: