1. Field of the Invention
This invention relates to clusters of computers and, more particularly, to control network apparatus and methods for clusters of computers enabling easy administration and economic operation.
2. History of the Prior Art
Computers have developed along a number of different but similar lines. In general, each such line has begun with a relatively simple processor capable of manipulating bits of information stored in some particular format. Storage for control software and data being manipulated is provided. Circuitry for providing input and output to the processor and for viewing and controlling the operation is also provided.
As the hardware for each type of digital computer is being developed to a useful state, various forms of software are usually being developed to make use of its capabilities. When one generation of software proves economically useful, more software is developed to make use of more of the capabilities of the computer hardware. When the software has stretched the capability of the hardware to its limits, the hardware must be improved and memory increased so that more, larger, and more capable programs may be run. With each new development, additional uses are visualized and newer generations of the computer are developed. This increase in computer capabilities seems to take place whatever the particular computer type may be until that type of computer reaches some practical limit.
Recently, even the most advanced computer architectures seemed to have been developed to a point at which increases in their capabilities do not provide an increased return in overall proficiency. For example, in order for a typical processor to handle more information faster, the number of transistors utilized by the processor and its memory are typically increased. This requires putting more transistors on the processor chip and placing the various components closer together. An increase of four times the number of processing transistors along with a commensurate increase in local memory is generally thought to increase speed of performance by ten to fifteen percent. Theoretically, a larger number of smaller transistors with shorter interconnections may be operated more rapidly with the expenditure of less power along the shorter current paths. However, the larger numbers of paths and transistor devices operating more rapidly expends more power; and a point seems to be rapidly approaching (or to have been reached already with some architectures) at which the proximity of the transistors devices and associated connecting circuitry increases interference and current leakage to a point at which overall operation deteriorates.
Various architectural changes have been attempted to obviate this limiting difficulty. Newer designs have tended to utilize a large number of processors which share the internal memory and other components of a single computer. Utilizing a number of processors tends to reduce the need to place so many transistors on a single chip thereby reducing individual processor complexity. This method of approaching the problem seems to work but only up to a limit; then a new set of problems arises. More particularly, the ability to control the access by a large number of processors to common memory reaches a limit fairly rapidly. Consequently, this method of development also appears to present an architectural dead end.
Another approach which has been taken to overcome the limitations posed by the known computer architectures is called clustering. In clustering, a large number of what may be relatively unsophisticated computers are joined by switches and cabling in a form of network by which those computers may share data. Then an operating system is provided by which all of the individuals computers may cooperate in handling large problems. Clustering offers a number of advantages. It allows controlling software to assign individual portions of a particular operation being undertaken to individual computers of the cluster, those portions to be handled by those individual computers, and the results of the individual portions to be furnished to the other computers of the cluster when they become available. This essentially allows a large operation to be broken into smaller operations which can be conducted in parallel.
Clustering is especially advantageous in allowing the use of a large number of inexpensive individual computers to handle a problem typically requiring a much more sophisticated and expensive computer. This allows the basic hardware of a cluster to be relatively inexpensive when contrasted to the hardware cost of advanced computers in which a number of processors share memory. Clustering does not seem to reach the computational limits of shared-memory multiprocessor machines since each individual computer of the cluster controls its own internal memory and computing operations. Moreover, for various reasons, clustering has been adopted by researchers who believe that software design is advanced when the software is freely available to those who might contribute to its improvement; consequently, a great deal of useful software is available inexpensively. For example, system software for clustering is available through the “Beowulf” project.
Because of these advantages, clustering has been increasingly used as a method for handling large problems.
In general, however, clustering has a number of inherent difficulties which have limited its use to a research tool. First, the operation of clusters has been restricted to highly capable computer scientists. This results because of the large amount of knowledge required for the operation of a cluster. For example, to set up a cluster requires that the individual computers all be joined together in some form of network by which cooperation can be coordinated; this requires a sophisticated knowledge of networks and their connections. Once the physical network is established, the various switches of the network must be configured before the cluster can be brought into operation. Once the switches have been configured, each individual computer must be booted and its correct operation in the network tested; this typically requires a local operator and a coordinating administrator at a selected controlling one of the computers. Bringing a cluster into operation typically requires a large staff of engineers and may take days. Because of the difficulty of start-up, once a cluster is running, it is typically kept running at all costs.
Keeping a cluster running is also quite difficult and time consuming. Once in operation and handling a particular problem, any failure of an individual computing unit requires that the failure be known to and its handling be coordinated with all of the other units. The system software controlling the cluster must be able to indicate to all of the units that a particular unit has malfunctioned and take steps to obviate the problem. This requires advising each individual unit that a particular unit has malfunctioned, taking steps to see that any incorrect data is isolated, and handing the function of that computing unit to some other unit. This often requires a significant amount of operating time. A full time staff is needed to coordinate the operation of a cluster, to keep the cluster functioning, and to handle problems as they arise.
Clusters have other problems. Like other computers, the individual units of a cluster require power to operate and because of that generate heat. The power required to operate the individual computers of a cluster, the switches connecting the units of the cluster, and associated air conditioning is similar to that required to operate super computers of similar processing power.
The power requirements for operating and the staffing needed have rendered the actual costs of using clusters similar to those for computer systems of similar capabilities. All of these problems have limited the use of clusters to high end laboratory use.
It is desirable to provide new methods and apparatus for providing clusters of computers capable of easy administration and economic operation.