This invention relates to a method and system for executing application programs on distributed computer systems, including multiple nodes interconnected through a network, and to a method and system for protecting application programs from hardware and software failures.
Using a distributed computer system has three general advantages over using a single computer. First, it is possible to increase its performance by adding nodes. Second, a failure of one node does not impact the remaining nodes, which makes it possible for a software platform running on the distributed computer system to tolerate the failure. Tolerating node failures can greatly improve the overall availability of applications running on the distributed computer system. Third, it is possible to construct an inexpensive distributed computer system out of commodity parts, such as of small modular servers connected by high-speed Ethernet.
A distributed computer system typically includes a sophisticated software platform for applications to harness its power. The software platform includes an execution control system that can distribute the applications over the nodes of a distributed computer system and manage their execution. The main functions of the execution control system include starting of the application's modules on the nodes of the distributed computer system; handling of hardware and software failures; and balancing the applications' workload over the nodes of the system. In systems designed for continuous availability, the execution control system's functions also include coordinating the upgrades of applications and of the platform itself.
Software platforms and their execution control systems for distributed computer systems have been a subject of research and development since the advent of computer networks. We describe below the main forms of conventional systems related to software platforms for distributed computer systems and point out their limitations. The limitations are overcome by the system and method of the present invention.
The first form of system presently in use includes software platforms used in telecommunication systems, exemplified by Ericsson's Open telecom platform and An ATM transport and control platform. While these platforms achieve performance scalability and continuous availability, they suffer from several drawbacks that have prevented their use for applications outside of the telecommunication market segment. Their drawbacks include the following. Their application programming model is not standard, which inhibits adoption of the platform by application developers. Also, their application programming model is specific to the telecommunication application domain, and lacks the features required in other application domains. Further, their architecture is tightly coupled with the proprietary application programming model and cannot be easily extended to support other application programming models, such as the Java 2 Platform, Enterprise Edition (“J2EE”) model. In addition, the applications of these systems use a proprietary, often complex, platform application programming interfaces (“API”) to achieve performance scalability and continuous availability. As each telecommunication platform uses a different API, application developers are unwilling to learn these APIs, which severely limits the adoption of the platforms. Also, some platforms require that applications be written in a proprietary programming language and rely on specialized proprietary hardware or operating system features. Additionally, the applications for some of the platforms include non-trivial execution control logic invoked by the platform's execution control system. The inclusion of such logic in the applications raises the barrier for developing applications. Furthermore, as the control logic is limited to a single application, the platform cannot easily optimize the execution of multiple applications, and the control logic in multiple applications can contradict each other. Some platforms cannot optimally distribute applications across the nodes of a distributed computer system if some nodes have higher CPU capacity than other nodes, and others require that the nodes all of the same type (for example, the nodes must be Intel processor-based computers running Linux). Lastly, some platforms, especially those for high-availability applications, dramatically restrict the network topology and sometimes require a “static” configuration (for example, an application can run only on two nodes and would not work if the system changed the number of nodes).
The second form of system presently in use includes high-availability clustering frameworks exemplified by Sun Cluster 3.0 Concepts and Windows Server 2003, Server Cluster Architecture. The main functionality of such a framework is to monitor a service and restart it on another node, should the node on which the service is running fail. The main drawbacks of these clustering frameworks include the following. The distribution model is limited to services with restartable processes. The platform starts and restarts the processes on the nodes according to some criteria associated with the service. This simplistic model does not work for applications that are composed of fine-grained distributable modules that are smaller than a process. For example, the service distribution model does not work for J2EE applications and cannot be easily extended to do so. Also, The framework does not provide a single-system view to management tools. Rather, each node is presented as an independent computer. The administrative complexity of administering each node individually prevents the frameworks from being used in distributed computer systems with large number of nodes. Additionally, most frameworks do not allow mixing nodes of different processor architectures, or nodes running different operating systems.
The third form of system presently in use includes platforms for distributed computer systems used for scientific applications such as Sun HPC 3.1 AnswerBook Collection from Sun Microsystems Inc. and Beowulf Linux Clusters. Their main drawbacks include the following. Their application programming model is limited to scientific applications. The programming model does not support other important applications types, such as transaction processing applications, Web services, database services, or telecommunication applications. Also, they do not provide the application availability typically required by enterprise and telecommunication applications, such as those listed in the previous item.
The fourth form of system presently in use includes clustered servers supporting Java 2 Platform, Enterprise Edition (“J2EE”) applications such as WebLogic 7.0, Creating and Configuring Server Domains from BEA Systems. These clustered servers are designed to allow applications to execute on multiple identically-configured J2EE server processes and perform transactions on data in a shared database. The main limitations of this type of system include the following. The servers are limited to a single programming model (i.e. to the J2EE applications). If the execution environment includes other distributed components, such as Web or database servers, the execution of these other components is controlled by a different distributed software platform. Using multiple different software platforms increases the administrative complexity of the distributed computing environment. Also, with the exception of Web session state, which can be stored in the application servers' memories, J2EE applications are stateless (meaning that all their application state is stored in a database outside of the application). Therefore, J2EE clustering platforms cannot be used generally for applications that have in-memory state spanning transaction boundaries. For example, it would be difficult or impossible to use clustered J2EE servers to run distributed non-J2EE Java applications or telecommunication call-control applications with in-memory state. Additionally, they do not provide single-system images of the computing resources. Although the J2EE servers typically include some mechanism to automatically provision applications to multiple nodes, each node appears as an independent computer for system administration purposes. This makes the system harder to use with a large number of nodes.
The fifth form of system presently in use is described in two United States patent applications. The first is titled “Method and Apparatus for Providing Application Specific Strategies to a JAVA Platform Including Start and Stop Policies” and has Ser. No. 09/812,536. The second is titled “Method and Apparatus for Providing Application Specific Strategies to a JAVA Platform Including Load Balancing Policies” and has Ser. No. 09/812,537. These references describe how control modules included in Java applications customize how the execution control system starts, stops, recovers failures, and balances the application service modules. A similar system is described in Ericsson's A Java application platform, which uses the term “root block” instead of “control module”. The main drawbacks of this type of system include the following. The mechanism seems to be limited to Java applications. Second, each application typically includes a “control module”. The control module includes the implementation of callback operations invoked by the platform during starting, stopping, failure recovery, and load-balancing of the application. The development of these operations would likely require specialized expertise that is not common among application developers. Also, at a closer inspection of the description of such systems indicates that the control modules are likely to include execution control state. If the execution control state is lost as a result of a failure, it is not clear from the description of these systems how the lost state could be reconstructed. If the state could not be reconstructed, the applications associated with the failed control module must be stopped and restarted, at least in some systems, which is unacceptable for most applications that require continuous availability. Additionally, the applications are conventionally organized into a parent-child control hierarchy. This concept is not part of the standard Java application programming model and would be foreign to application developers. Finally, the control modules are written by application developers, who are not experts in the field of execution control. Therefore, the control modules are likely to contain more bugs than the platform that is developed and well tested by experts in the field. As a software error in one of the control modules could negatively impact the operation of the entire platform, including other applications, the use of control modules may reduce the overall reliability and availability of the entire distributed computer system.
Some prior art systems use replication as a technique for achieving tolerance to failure thereby increasing the availability of the application to its users. When replication is used, an application is partitioned into smaller execution modules and each execution module is replicated on two or more nodes of a distributed computer system. Many of the prior-art systems suffer from one or both of the following flaws. First, some systems using prior art distribute the replicas across the nodes such that after a failure of node, another node in the system will take over the entire workload of the failed node. The disadvantage of this approach is that the capacity of the system (and its price) is doubled in order to handle a node failure (these systems are sometimes called 2-N availability (or 2-N redundancy) systems because twice as many nodes are used in order to tolerate a single node failure). Second, some other systems of prior art distribute the replicas of execution modules across the nodes such that after a failure of any node, all the remaining nodes take over a portion of the workload of the failed node (these systems are often called N+1 availability (or N+1 redundancy) systems because only one extra node is required to tolerate a single node failure). The main problem with this approach is that if the number of nodes is high, the exposure to a double node failure could become significant. With a double node failure (when a second node fails shortly after the failure of a node), the state of an application could be lost because the state was replicated across the two failed nodes.