The present invention relates to resource management in a system of interconnected computers, and more particularly to the monitoring and allocation of cluster nodes, cluster memory, and other cluster computing resources.
Those portions of U.S. Patent Application No. 60/038,251 filed Feb. 21, 1997 which describe previously known computer system components and methods are incorporated herein by this reference. These incorporated portions relate, without limitation, to specific hardware such as processors, communication interfaces, and storage devices; specific software such as directory service providers and the NetWare operating system (NETWARE is a mark of Novell, Inc.); specific methods such as TCP/IP protocols; specific tools such as the C and C++ programming languages; and specific architectures such as NORMA, NUMA, and ccNUMA. In the event of a conflict, the text herein which is not incorporated by reference shall govern. Portions of the ""251 application which are claimed in this or any other Novell patent application are not incorporated into this technical background.
Clusters
A cluster is a group of interconnected computers which can present a unified system image. The computers in a cluster, which are known as the xe2x80x9ccluster nodesxe2x80x9d, typically share a disk, a disk array, or another nonvolatile memory. Computers which are merely networked, such as computers on the Internet or on a local area network, are not a cluster because they necessarily appear to users as a collection of connected computers rather than a single computing system. xe2x80x9cUsersxe2x80x9d may include both human users and application programs. Unless expressly indicated otherwise, xe2x80x9cprogramsxe2x80x9d includes programs, tasks, threads, processes, routines, and other interpreted or compiled software.
Although every node in a cluster might be the same type of computer, a major advantage of clusters is their support for heterogeneous nodes. As an unusual but nonetheless possible example, one could form a cluster by interconnecting a graphics workstation, a diskless computer, a laptop, a symmetric multiprocessor, a new server, and an older version of the server. Advantages of heterogeneity are discussed below.
To qualify as a cluster, the interconnected computers must present a unified interface. That is, it must be possible to run an application program on the cluster without requiring the application program to distribute itself between the nodes. This is accomplished in part by providing cluster system software which manages use of the nodes by application programs.
In addition, the cluster typically provides rapid communication between nodes.
Communication over a local area network is sometimes used, but faster interconnections are much preferred. Compared to a local area network, a cluster system area network has much lower latency and much higher bandwidth. In that respect, system area networks resemble a bus. But unlike a bus, a cluster interconnection can be plugged into computers without adding signal lines to a backplane or motherboard.
Clustering Goals
Clusters may improve performance in several ways. For instance, clusters may improve computing system availability. xe2x80x9cAvailabilityxe2x80x9d refers to the availability of the overall cluster for use by application programs, as opposed to the status of individual cluster nodes. Of course, one way to improve cluster availability is to improve the reliability of the individual nodes.
However, at some point it becomes cost-effective to use less reliable nodes and swap nodes out when they fail. A node failure should not interfere significantly with an application program unless every node fails; if it must degrade, then cluster performance should degrade gracefully. Clusters should also be flexible with respect to node addition, so that applications benefit when a node is restored or a new node is added. Ideally, the application should run faster when nodes are added, and it should not halt when a node crashes or is removed for maintenance or upgrades.
Adaptation to changes in node presence provides benefits in the form of increased heterogeneity, improved scalability, and better access to upgrades. Heterogeneity allows special purpose computers such as digital signal processors, massively parallel processors, or graphics engines to be added to a cluster when their special abilities will most benefit a particular application, with the option of removing the special purpose node for later standalone use or use in another cluster. Heterogeneity also allows clusters to be formed using presently owned or leased computers, thereby increasing cluster availability by reducing cost and delay. Scalability allows cluster performance to be incrementally improved by adding new nodes as one""s budget permits. The ability to add heterogeneous nodes also makes it possible to add improved hardware and software incrementally.
Clusters may also be flexible concerning the use of whatever nodes are present. For instance, some applications will benefit from special purpose nodes such as digital signal processors or graphics engines. Ideally, clusters support three types of application software: applications that take advantage of special purpose nodes, applications that view all nodes as more or less interchangeable but are nonetheless aware of individual nodes, and applications that view the cluster as a single unified system. xe2x80x9cCluster-awarexe2x80x9d applications include distributed database programs that expect to run on a cluster rather than a single computer. Cluster-aware programs often influence the assignment of tasks to individual nodes, and typically control the integration of computational results from different nodes.
The following situations illustrate the importance of availability and other cluster performance goals. The events described are either so frequent or so threatening (or both) that they should not be ignored when designing or implementing a cluster architecture.
Software Node Crash
Software errors, omissions, or incompatibilities may bring to a halt any useful processing on a node. The goal of maintaining cluster availability dictates rapid detection of the crash and rapid compensation by either restoring the node or proceeding without it. Detection and compensation may be performed by cluster system software or by a cluster-aware application. Debuggers may also be used by programmers to identify the source of certain problems. Sometimes a software problem is xe2x80x9cfixedxe2x80x9d by simply rebooting the node. At other times, it is necessary to install different software or change the node""s software configuration before returning the node to the cluster. It will often be necessary to restart the interrupted task on the restored node or on another node, and to avoid sending further work to the node until the problem has been fixed.
Hardware Node Crash
Hardware errors or incompatibilities may also prevent useful processing on a node. Once again, availability dictates rapid detection of the crash and rapid compensation, but in this case compensation often means proceeding without the node.
In many clusters, working nodes send out a periodic xe2x80x9cheartbeatxe2x80x9d signal. Problems with a node are detected by noticing that regular heartbeats are no longer coming from the node. Although heartbeats are relatively easy to implement, they continually consume processing cycles and bandwidth. Moreover, the mere lack of a heartbeat signal does not indicate why the silent node failed; the problem could be caused by node hardware, node software, or even by an interconnect failure.
Interconnect Failure
If the interconnection between a node and the rest of the cluster is unplugged or fails for some other reason, the node itself may continue running. If the node might still access a shared disk or other sharable resource, the cluster must block that access to prevent xe2x80x9csplit brainxe2x80x9d problems (also known as xe2x80x9ccluster partitioningxe2x80x9d or xe2x80x9csundered networkxe2x80x9d problems). Unless access to the shared resource is coordinated, the disconnected node may destroy data placed on the resource by the rest of the cluster.
Accordingly, many clusters connect nodes both through a high-bandwidth low-latency system area network and through a cheaper and less powerful backup link such as a local area network or a set of RS-232 serial lines. The system area network is used for regular node communications; the backup link is used when the system area network interconnection fails. Unfortunately, adding a local area network that is rarely used reduces the cluster""s cost-effectiveness. Moreover, serial line protocols used by different nodes are sometimes inconsistent with one another, making the backup link difficult to implement.
Sharable Resource Reallocation
Sharable resources may take different forms. For instance, shared memory may be divided into buffers which are allocated to different nodes as needed, with the unallocated buffers kept in a reserve xe2x80x9cpoolxe2x80x9d. In some clusters, credits that can be redeemed for bandwidth, processing cycles, priority upgrades, or other resources are also allocated from a common pool.
Nodes typically have varying needs for sharable resources over time. In particular, when a node crashes or is intentionally cut off from the cluster to prevent split-brain problems, the shared buffers, credits, and other resources that were allocated to the node are no longer needed; they should be put back in the pool or reallocated to working nodes. Many clusters do this by locking the pool, reallocating the resources, and then unlocking the pool. Locking the pool prevents all nodes except the allocation manager from accessing the allocation lists while they are being modified, thereby preserving the consistency of the lists. Locking is implemented using a mutex or semaphore. Unfortunately, locking reduces cluster performance because it may block processing by all nodes.
Summary
In short, improvements to cluster resource management are needed. For instance, it would be an advance in the art to distinguish further between different causes of cluster node failure. It would also be an advance to provide a way to coordinate shared resource access when an interconnect fails without relying on a local area network or a serial link. In addition, it would be an advance to reallocate sharable resources without interrupting work on all nodes. Such improved systems and methods are disclosed and claimed herein.
The present invention provides methods, systems, and devices for resource management in clustered computing systems. The invention aids rapid, detailed diagnosis of communication problems, thereby promoting rapid and correct compensation by the cluster when a communication failure occurs.
When a node or part of a system area network becomes inoperative, remote probing retrieves either a value identifying the problem or an indication that the remote memory is inaccessible; verifying inaccessibility also aids in problem diagnosis. In various embodiments the retrieved value may include a counter, a validation value, a status summary, an epoch which is incremented (or decremented) by each restart or each reboot, a root pointer that bootstraps higher level communication with other cluster nodes, and a message area that provides additional diagnostic information.
Remote memory probing allows the system to more effectively select between different compensating steps when an error condition occurs. One of the most potentially damaging problems is a xe2x80x9csplit brainxe2x80x9d. This occurs when two or more nodes cannot communicate to coordinate access to shared storage. Thus, a significant risk arises that the node will corrupt data in their shared storage area. In some embodiments, the invention uses an emergency message location on a shared disk to remove the failed node from the cluster while allowing the failed node to be made aware of its status and thus prevent data corruption. The remaining active nodes may also coordinate their behavior through the emergency message location. When a node is disconnected from a cluster the invention provides methods that make reduced use of locks by coordinating locking with interrupt handling to release the global resources that were previously allocated to the node. These methods also provide an improved system to reallocate resources throughout the cluster. Other features and advantages of the present invention will become more fully apparent through the following description.