1. Field of the Invention
This invention relates to the field of distributed execution of software programs and, more particularly, to a method and system for coordinating all aspects of cooperation between different copies of the same software program residing on a single machine.
2. Description of the Related Art
The use of networked computers has increased dramatically since the first affordable PC's were introduced in the early 1980's. In today's business world it is not unusual to find hundreds of computers interconnected in a Local Area Network (LAN) arrangement so that all employees of a business organization can communicate with each other, share file access, share peripherals such as printers, etc. Wide Area Networks (WAN's) increase the level of connectivity by interconnecting several LAN's (e.g., LAN's for two or more geographically diverse locations of the same company) together to form an even larger network.
In the prior art, there are various examples where multiple copies of a software program, running in a network with one copy of the program running on each node (machine) in the network, cooperate with each other to perform one or more tasks. For example, in the Open Shortest Path First (OSPF) routing protocol in the Internet, and in the computation of spanning trees in token ring bridges, each machine runs one copy of the software that performs the protocol. Coordinating the distributed (inter-machine) computation among the networks relative to these distributed protocols is a very complex operation. At the same time, more and more services are provided on networks, and usage of individual services is increasing rapidly due to the ease with which very large numbers of users can access them.
One common technique for increasing the number of users that can reliably and efficiently use a network service is to run multiple copies of the service on each machine, to better use the processing power of such a machine (e.g., one with multiple central processing units (CPUs)). While such arrangements invariably save time and money and increase efficiency, administration of the multiple copies, e.g., updating the configuration of the multiple copies of the software program or coordinating some aspect of their execution on the machine, e.g., “license counting” (keeping track of the collective use of some resources, e.g., the number of connections created to backend data sources, by all copies of the software program running on the machine at any given time to ensure compliance with license restrictions) can be a complex task.
In the prior art, where multiple machines in a network, each machine running a single copy of a program, perform some cooperative task, coordination of these tasks is typically performed by “electing” a supervisor machine such that the software copy on the elected machine operates to perform the coordination, i.e., the software copy on the supervisor machine is the supervisor of the identical software programs. The supervisor has a master-slave relationship with the copies of the program running on the other machines, and coordinates the cooperation among the copies. FIG. 1 illustrates a simple example of such an arrangement.
Referring to FIG. 1, three machines, 101, 102, and 103, are interconnected via a network 100 in a typical arrangement. Each machine includes an identical copy (a clone) of a software program 104. In this example machine 102 has been designated as the supervisor machine and the software program 104 on machine 102 carries out the coordination of the cloned software programs.
Each of the machines 101, 102, and 103 has a unique identifier of some kind, typically an Internet Protocol (IP) address. This makes the basis of the coordination function of the supervisor relatively simple; to obtain task-specific information regarding the software 104 on a particular machine, the supervisor connects to the particular machine using the unique identifier and communicates with the software 104 residing thereon.
In this scenario, supervisor election typically occurs using a network level broadcast mechanism. Each copy of program 104 announces on the network its intention to become the supervisor. Some tie-breaking mechanism (e.g., the “smallest IP address wins”) is used to elect the supervisor. Once the supervisor is elected using this process, each subordinate creates a connection to the supervisor. Each subordinate uses the connection to the supervisor to perform the coordination function it is designed for. The supervisor and the subordinates also constantly perform “heartbeat/keep alive” protocols over the network. This allows all the subordinates to detect if/when the supervisor terminates (normally or abnormally), in which case the reelection process over the network is repeated.
It is now becoming common to use multiple copies of the same software program on a single machine. For example, in a Web application server that provides a server-side Java-based execution environment for dynamic Web page generation, e.g., Java Server Pages (JSP) technology, the server might allow multiple Java Virtual Machines (JVMs) to be run on a single machine, to allow better CPU utilization of a multiprocessor machine, and to provide better fault tolerance in the event of a crash of a single JVM. In such a configuration, multiple copies of the same program (application) could be running on each JVM machine, and while each copy of the program performs its “main task” on an individual basis (i.e., each copy of the program running on the JVM example above could access a database to generate Web pages dynamically, in response to a browser request), these programs might need to occasionally cooperate with each other to perform various tasks such as the maintenance of a registry of available copies of the program on a machine, for administration purposes; or the computation of resource usage (e.g., computing the number of connections being used) by all the copies of the program on a particular machine.
IBM's WebSphere is an Internet software platform developed by IBM which, among other things, allows the running of multiple JVM's in a single machine. WebSphere includes a special “supervisor” program called the WS Admin Server which administers the JVM's, allowing them to be started and stopped. WebSphere includes an “admin repository” which is a relational database that could reside on any node. The admin repository contains all of the configuration information for the JVM's, including a list of the JVM's residing on the machine.
The WebSphere Admin Server (supervisor) runs as a separate operating system process and, it is the understanding of the applicant that no TCP connections are involved between the Admin Server process and the JVM's. Thus, WebSphere has a specialized supervisor program; the WebSphere supervisor only allows administration operations, i.e., start/stop, and does not allow a coordination function; the admin repository, where the list of the JVM's resides, is not in the supervisor process; if the WS Admin Server terminates, no additional administration is possible; and there are no TCP connections between subordinates (JVM's) and supervisor (WS Admin Server).
WebSphere also includes a special purpose plug-in which operates to obtain requests from the web server and pass the requests to one of the JVM clones (which are identical in that they can handle identical requests). The plug-in performs this forwarding function by using a private protocol. In passing requests from the Web server to the JVM clones, the plug-in sometimes utilizes TCP/IP connections between the plug-in and the JVM's. However, the plug-in performs no supervisory or administrative functions with respect to the JVM's, and if the plug-in terminates, no special action is taken by the JVM's to identify a new plug-in to establish the connection.
It would be desirable to have a method and system for enabling the coordination of multiple copies of a software program residing on a single machine which solves the problems identified above.