Computers are used to operate critical applications for millions of people every day. These critical applications may include, for example, maintaining a fair and accurate trading environment for financial markets, monitoring and controlling air traffic, operating military systems, regulating power generation facilities and assuring the proper functioning of life-saving medical devices and machines. Because of the mission-critical nature of applications of this type, it is crucial that their host computer remain operational virtually all of the time.
Despite attempts to minimize failures in these applications, the computer systems still occasionally fail. Hardware or software glitches can retard or completely halt a computer system. When such events occur on typical home or small-office computers, there are rarely life-threatening ramifications. Such is not the case with mission-critical computer systems. Lives can depend upon the constant availability of these systems, and therefore there is very little tolerance for failure.
In an attempt to address this challenge, mission-critical systems employ redundant hardware or software to guard against catastrophic failures and provide some tolerance for unexpected faults within a computer system. As an example, when one computer fails, another computer, often identical in form and function to the first, is brought on-line to handle the mission critical application while the first is replaced or repaired.
Exemplary fault-tolerant systems are provided by Stratus Technologies International of Maynard, Mass. In particular, Stratus' ftServers provide better than 99.999% availability, being offline only two minutes per year of continuous operation, through the use of parallel hardware and software typically running in lockstep. During lockstep operation, the processing and data management activities are synchronized on multiple computer subsystems within an ftServer. Instructions that run on the processor of one computer subsystem generally execute in parallel on another processor in a second computer subsystem, with neither processor moving to the next instruction until the current instruction has been completed on both. In the event of a failure, the failed subsystem is brought offline while the remaining subsystem continues executing. The failed subsystem is then repaired or replaced, brought back online, and synchronized with the still-functioning processor. Thereafter, the two systems resume lockstep operation.
Though running computer systems in lockstep does provide an extremely high degree of reliability and fault-tolerance, it is typically expensive due to the need for specialized, high quality components, and the duplication of components, as well as the fact that the smaller number of such computer manufacturers relative to consumer quality computers prevents the economies of scale possible with consumer quality computers. Furthermore, while 99.999% availability may be necessary for truly mission critical applications, many users can operate perfectly well with a somewhat lower ratio of availability, and would happily do so if the systems could be provided at lower cost.
Virtualization technology has recently become a popular means for reducing a computer networks reliance on hardware. Since the 1960s computer systems have begun to create additional resources through the use of abstract or virtual computers. Through virtualization, several independent and isolated computing environments may be resident and run from one single hardware configuration, such as a server. The ability to create, maintain and operate countless computing environments on a single server can greatly reduce the cost of operations for any entity in which a computer network is utilized.
What is needed therefore is a cost-effective, easily-installable, fault tolerant, high-availability computer network implemented through the abstraction of computing resources.