1. Field of the Invention
This invention relates generally to parallel computing systems and, more specifically, to an Expanded Method and System for Parallel Operation and Control of Legacy Computer Clusters
2. Description of Related Art
Parallel computation, the use of multiple processors (both within one computing device as well as between networked computing devices) to solve large computational tasks, has been an objective of the industry for quite some time. In seeking to serve these large computational tasks, scientists have often written their own software code—this code was historically written specifically for parallel computers (i.e. computers having multiple processors). While these “parallel applications” functioned adequately (when constructed well, of course), their utility was limited to the particular task (and many times hardware) for which they were specifically written; changes in hardware and/or software requirements typically would require substantial, costly, software revisions. Furthermore, these applications were typically unable to be used on other computing devices.
Large parallel computers are typically located at major supercomputing centers, the hardware consists of large parallel computers (e.g. Cray T3E, IBM SP, Fujitsu, etc.), the software is commonly proprietary vendor software (many time a Unix variant). These “supercomputers” are managed by large staffs of professional administrators, and the majority of their operations are not accessible directly by individual users (except through the administrators).
As personal computers led the hardware and software evolution to where substantial computing power became attainable by the individual, systems known as “Clusters” became prevalent. Computer clusters are a type of parallel computation system where a network of computing devices' processors are tightly coupled to share computation tasks in parallel fashion. An early and fairly prevalent version of a cluster computer is the “Beowulf” system first assembled at a NASA site to solve Earth Sciences problems (NASA Goddard Space Flight Center). The Beowulf cluster is characterized by a set of personal computers connected by specialized network hardware and running a specialized version of the open-source Linux operating system.
A Beowulf class cluster computer is distinguished from a Network of Workstations by several subtle but significant characteristics. First, the nodes in the cluster are dedicated to the cluster in order to “ease load balancing problems,” by removing any external factors from affecting the performance of individual nodes. A second characteristic of these systems is that the interconnection network is isolated from any external network such that the network load is determined only by the application being run on the cluster. Along with this architecture, all the nodes in the cluster are within the administrative jurisdiction of the cluster. Since there is no external network access or participation, there is no need (or provisions for) network security.1 
While proponents of Beowulf systems (running on Linux operating systems) assert that they are extremely user-friendly to run parallel applications, it seems apparent that the clusters themselves are anything but simple to design and construct. In fact, it has been observed that at least two Beowulf clusters required approximately six months each to construct. Furthermore, the need for the computers to be captured (i.e. totally dedicated to use as a member of the cluster) in order to be a part of the cluster eliminates the possibility of making use of existing legacy networked computers.
If we turn to FIG. 1, we can review the general structure of a conventional (or legacy) computing device, such as a personal computer, so that we might next analyze how such a computing device might be modified in order to become part of a Beowulf cluster. FIG. 1 is a block diagram of pertinent functional components of a conventional computing device 10.
As shown, the computing device 10 comprises one or more processors 12 for performing the computations that are the essence of the function for which the computer 10 is used. The device 10 will also include a conventional operating system 14 for controlling the computer's operation. In communication with (or at least controlled by) the operating system are one or more input-output sub-systems, such as a video monitor, a keyboard, a network portal, etc. Another important module controlled by the operating system 14 is the memory 18. It is in the memory 18 (in this case random access memory, or RAM) that software applications reside while they are being executed; of course the memory 18 is closely coupled with the processor(s) 12.
To say that a software application such as any of 20A-20C is being executed on the computing device 10 is to actually say that the calculations that make up the applications are being operated upon by the processor(s) 12. In the case of the typical application 20, the operating system 14 is the “translator” between the application 20 written in a so-called high level language and the processor(s) 12, although in the case of applications written in “machine language,” such as 20C, the application 20C interfaces directly with the processor(s). The operating system 14, although heretofore described generally, actually includes a critical component known as the “kernel.”
The kernel is the central module of an operating system 14. It is the part of the operating system 14 that loads first, and it remains in main memory 18. Because it stays in memory 18, it is important for the kernel to be as small as possible while still providing all the essential services required by other parts of the operating system and applications. Typically, the kernel is responsible for memory management 26, process and task management 22, I/O management 24 and disk management.2 In order to better understand the nuances of the present invention, the kernel is represented here as being discrete components responsible for the various functional areas of the operating system 14; in fact, only a single kernel is run on a particular machine at a particular time. If we now turn to FIG. 2, we can examine one of the drawbacks of the prior methods and systems for creating cluster computers.
FIG. 2 is a block diagram of pertinent functional components of a conventional computing device 11 as it would be modified for operation as a node in a cluster computer under the prior art. In order for the cluster node control and interface software application 20D to be able to operate on the conventional computing device 11 to give control of its processor(s) 12 to another computing device (which is necessary to cluster compute), it has always been necessary, at the very minimum, to replace the original operating system with a modified operating system 15. The new operating system 15 is specifically designed to provide a kernel having the new functionality necessary to permit the cluster node control and interface application to offer the device's processor(s) up for control by an external computer (i.e. the control computer in the cluster).
This new kernel will typically require a revised CPU kernel 22A to provide the cpu-sharing capability, a revised I/O kernel 24A to exchange job messaging with the external computer(s), and a revised memory kernel 26A to provide control, monitoring and access to the device's memory 18 to external computers. There are at least two problems with replacing the original operating system with a special-purpose operating system: (1) there is a much higher likelihood of instability in the operating system due to conflicts and/or errors, and (2) the revised operating system is unlikely to maintain its original functionality—this means that the device 11 would not be suitable for general use any time the modified operating system is “booted.” The result of this loss of original functionality is that the cluster will not be able to capitalize on existing (even idle) computer resources—the systems must be dedicated to the cluster and nothing else.
What is needed is a cluster node control software application that can operate with an existing, conventional or legacy operating system to provide shared processor resources to external computers in order to create a cluster computer without the need for computing resources dedicated to this task.