A cluster of computers is a group of interconnected computers, which can present a unified system image. The computers in a cluster, which are known as the “cluster nodes”, typically share a disk, a disk array, or another nonvolatile memory. Computers which are merely networked, such as computers on the Internet or on a local area network, are not a cluster because they necessarily appear to users as a collection of independent connected computers rather than a single computing system. “Users” may include both human users and application programs. Unless expressly indicated otherwise, “programs” includes computer programs, tasks, threads, processes, routines, and other interpreted or compiled computer software.
Although every node in a cluster might be the same type of computer, a major advantage of clusters is their support for heterogeneous nodes. One possible example is an interconnection of a graphics workstation, a diskless computer, a laptop, a symmetric multiprocessor, a new server, and an older version of the server. Advantages of heterogeneity are discussed below. To qualify as a cluster, the interconnected computers must present a unified interface. That is, it must be possible to run an application program on the cluster without requiring the application program to distribute itself between the nodes. This is accomplished in part by providing cluster system software which manages use of the nodes by application programs.
In addition, the cluster typically provides rapid peer to peer communication between nodes. Communication over a local area network is sometimes used, but faster interconnections are much preferred. Compared to a local area network, a cluster area network usually has a much lower latency and much higher bandwidth. In that respect, cluster area networks resemble a bus. But unlike a bus, a cluster interconnection can be plugged into computers without adding signal lines to a backplane or motherboard.
Clusters may improve performance in several ways. For instance, clusters may improve computing system availability. “Availability” refers to the availability of the overall cluster for use by application programs, as opposed to the status of individual cluster nodes. Of course, one way to improve cluster availability is to improve the reliability of the individual nodes.
However, at some point it becomes cost-effective to use less reliable nodes and replace nodes when they fail. A node failure should not interfere significantly with an application program unless every node fails; if it must degrade, then cluster performance should degrade gracefully. Clusters should also be flexible with respect to node addition, so that applications benefit when a node is restored or a new node is added. Ideally, the application should run faster when nodes are added, and it should not halt when a node crashes or is removed for maintenance or upgrades. Adaptation to changes in node presence provides benefits in the form of increased heterogeneity, improved scalability, and better access to upgrades. Heterogeneity allows special purpose computers such as digital signal processors, massively parallel processors, or graphics engines to be added to a cluster when their special abilities will most benefit a particular application, with the option of removing the special purpose node for later standalone use or use in another cluster. Heterogeneity allows clusters to be formed using presently owned or leased computers, thereby increasing cluster availability and reducing cost. Scalability allows cluster performance to be incrementally improved by adding new nodes as one's budget permits. The ability to add heterogeneous nodes also makes it possible to add improved hardware and software incrementally.