One of the first computing models was the mainframe model. This model featured a central processing unit (“CPU”), volatile memory storage, and input/output (“I/O”) devices. It often had different cabinets for each component storage, I/O, RAM (random access memory), task, program, and job management. According to this model, the host mainframe processor manages multiple processes and I/O devices, but all processes are run on the host mainframe with the user interfaces being “dumb” terminals communicating with the mainframe but not running any of the processes. Communication between the mainframe and the user terminals is handled by a program running (usually in the background) on the mainframe. Other than user interface information, nothing else is communicated between the host mainframe and the terminals. In effect, the terminal users are connected to a large computer (the mainframe) by long wires.
It is important to note that the operating systems for modern mainframes are limited in number, e.g., UNIX, Linux, VMS, Z/OS, Z/VM, VSE/ESA. Conventionally a mainframe's control is limited to the processes being run within the boundaries of its computing machinery. Machines and other CPUs (central processing units) not associated with the mainframe's CPU and its operating system (OS) are treated as foreign systems—each foreign system conventionally consisting of a CPU, input/output memory and other devices. Conventionally each mainframe CPU runs a single instance of its operating system, and is dedicated to that system and the operating system schedules CPU cycles and allocates memory only within that system.
The above mainframe model was the standard until the late 1960's and early 1970's when several components of a mainframe were consolidated into what are now known as “microprocessors.” In conventional microprocessors, memory management, arithmetic units and internal registers are incorporated into a single integrated circuit (commonly called a “chip”). Peripheral devices are handled by various interrupt schemes. Conventionally these interrupts are electrical signals originating, directly or indirectly, from external peripherals for the purpose of interrupting the microprocessor's normal processing to service a peripheral device that requires immediate attention. A keystroke on a keyboard is good example of a peripheral that can normally create an interrupt and initiate an interrupt cycle. When a key is depressed, an interrupt is generated to the microprocessor's CPU by presenting a high or low signal on one of the microprocessor's interrupt pins. When an interrupt occurs the CPU must relocate all information relevant to its ‘current’ job to an external location, i.e., a “stack” memory. The information stored in the stack generally includes the contents of all the CPU registers, program counter information or the memory address of the program that it was executing before the interrupt. This stack information is stored to enable the CPU to return to its pre-interrupt state after servicing the interrupt, the pre-interrupt state being defined by the contents of the stack. When the CPU services the interrupt, it starts with a blank page, and a known state. After putting the ‘current’ job contents on the stack, the CPU examines a known location in memory to determine which device produced the interrupt. In the keyboard case, the CPU determines that a key was depressed, ‘jumps’ to a memory address, starts executing the program code that determines which key was depressed, and puts the results in memory, memory cache, or CPU register. After ‘servicing’ the interrupt, the CPU retrieves the pre-interrupt information, restores all relevant registers and counters to their pre-interrupt states, and returns to the task (the current job) it was performing before the interrupt occurred. In conventional microprocessor based disk operating systems, each microprocessor within the computing machine is dedicated to that operating system, disk storage, memory, program counter, and system scheduler; and the operating system manages its memory, interrupt stack, keyboard and so forth.
Microprocessor based systems have a huge cost savings over mainframe computers. They require less maintenance, can be massed produced, and they have far more software available at affordable prices. However, the microprocessor systems are less reliable than a mainframe system. Microprocessors are prone to system freezes and component failure. When performing system maintenance, microprocessor systems must be taken off-line and brought down to a single user state, or turned off altogether. In mainframe systems, system maintenance can be performed while the machine is running, which is advantageous when the computer system must be online all the time. Each computer system's architecture and structure determines the cost, type of maintenance, and the reliability of that system. In the microprocessor system each CPU is dedicated to a single operating system, so when a problem/fault occurs, the whole computing machinery stops. These faults can be hardware or software related, but each system will come to a halt if a critical system part fails or even a single CPU becomes lost or misdirected through some software error. The reliability problem has been attacked from several directions. Redundancy is the most common approach, e.g., double or even triple mirrored storage systems, duel ported memory, and more.
Several computer architectures have addressed these problems. In the early 1990's, CRAY computers developed a microprocessor based computer that was dubbed, “Highly Available.” It featured redundant computer systems each having its own memory, I/O, and scheduler. The redundant systems shared a common data storage, either in disk or tape storage, and featured a time monitor, commonly called “watch dog timer;” i.e., if one system didn't respond to a signal from the other system in some selected amount of time, the monitor would assume that the non-responding system was “down” and would start up new processes on the remaining system to replace those lost on the failed system. Any process or user that was connected to the failed system needed to restart the process on the remaining system since it was available for processing the tasks that were performed by the failed system. This redundancy makes this type of computing system highly reliable, and the act of bringing up the processes on the new system is called “highly available.” As an example, during the Sep. 11, 2001 crisis when Highly Available systems located in New York failed, they failed-over to systems in New Jersey and Wilmington Del.
In addition, some of the microprocessor-based systems are capable of reallocating CPU's. Two or more operating systems can be loaded on a machine with multiple CPU's of the same type and manufacture, and each of these CPU's can be allocated to a particular operating system by manual re-deployment. Each operating system has its own memory, I/O, and scheduler, and interrupt system. However, once a CPU is committed to an operating system within the computing machine and that machine fails, the CPU will fail with that machine and cannot be reused until the operating system is restarted, or manual intervention reassigns the CPU. The main advantage of this type of system is its ability to distribute processes and processing threads to several processors which significantly decreases computing time.
Most of the world's fastest computers utilize hosts of microprocessors for parallel processing but such systems are extremely complex. For example, “backplane” circuit boards that connect the microprocessors together require as many as thirty-two circuit layers and are so complicated that only a single manufacturer is capable of producing these complex circuit boards.
Clustering is another technique used to increase reliability and decrease computing time. Currently certain vendors, like CRAY, IBM, HP, and SUN MICROSYSTEMS have increased the number of systems that can join a cluster from two to thirty-two, but each machine acts independently of one another and does not share a common memory between operating systems. It has no control over program counters and does not attempt to make the system more than highly available.
Another type of clustering system is available for Linux systems called a “Beowulf cluster” which is a type of parallel system that utilizes PC (personal computer) style microprocessors. It is a cluster of desktop-type computers that are networked with one of the computers acting as a master. Users can log in to the master node and run a script which will run their program on the slave nodes of the cluster. This program can contain a specially designed code, such as MPI, to communicate with its siblings on the other node. Each slave's individual process can use this code to communicate with all of the other processes to transmit data, calculations results, etc., combining the power of all of the computers into one virtual super computer. The problem with the Beowulf system is that if one of the master systems is lost, the whole cluster must be rebooted because the master system has lost the state of the machine. Without knowing the system state, the only option is to start over from scratch and a reboot is required. In addition, each slave computer has its own set of memory, its own I/O, interrupt vectors, stacks, and schedulers. It is the unknown state of these items that require the entire system to be rebooted.
The Meta Mentor system of this invention is highly advantageous compared to the above mentioned solutions because it employs novel parallel processing techniques to significantly reduce computing time, while it also avoids the loss-of-state problems when one of the parallel processors or system components catastrophically fails. The system according to this invention is thus superior to even the “highly available” prior art.
Other advantages and attributes of this invention can be seen from a reading of the specification herein and a viewing of the drawings.