Distributed networks are well known to programmers and computer system architects. A distributed network may include multiple nodes, computers, or servers. As used herein, a server is defined as a software process. A node or cluster is a group of servers that may exist on a single hardware machine or share a physical resource such as a memory disk. Each server in a network usually has applications or objects that perform different functions. An application on a particular server may be initiated by another server or by the server it resides on. Distributed networks are advantageous in that several applications required to accomplish a task or a process may be distributed among several servers. The distributed applications may then be called upon when needed. Processes invoked simultaneously may be run on different servers instead of weighing down a single server and processor. This advantageously distributes processing power and contributes to a more efficient network.
Distributed transactions can span multiple servers, and servers often host resource managers (e.g. database connection pools or JMS queues) which participate in distributed transactions. As a result of distributed transaction participation, locks or other internal resources can be held up in the resource managers (e.g. databases locks are acquired for database records that are updated in a distributed transaction) on behalf of the distributed transaction until the distributed transaction is completed. For each distributed transaction, a particular server acts as the coordinator, which drives the participating transactional resources to commit atomically, and thus the transaction to completion, via the Two Phase Commit (2PC) protocol. In the first phase of the 2PC protocol, the coordinator logs a record of the transaction and its participants persistently in its TLOG files after all participants are prepared successfully. Once prepared, all participants hold on to the acquired locks or other internal resources for the transaction until it is told to commit or rollback by the coordinator. In the second phase of the 2PC protocol, the coordinator commits all the participants, which then make the updates durable and release the locks and other internal resources. After all participants are successfully committed, the coordinator then releases the log record from its TLOG. Thus, if a coordinator fails, all the in-flight transactions that are logged in its TLOG files cannot be driven to completion, and thus all participants cannot release their locks or other internal resources, until the coordinator is restarted. Thus, with systems of the prior art, transaction recovery cannot take place before a failed server restarts. This limits the availability of transaction recovery of the failed server and thus the availability of other XA resources (e.g. JMS backends).
In addition to unexpected server failure, a server may be brought down intentionally. Application servers are often configured to run on specific machines to service client requests. These machines are brought down for periodic maintenance, machine servicing, and other reasons. As a result, the servers located on the downed machine are not able to service client requests to that machine or perform recovery of in-doubt transactions until the servers are restarted.
One approach the prior art has taken to address this problem is to migrate servers and their TLOG files to a back-up or alternate machine. This allows unfinished transactions in a TLOG to be processed thus improving the availability of the failed server and preserving the operation and efficiency of a network. One such server migration system for use in a distributed network is included in the BEA TUXEDO application. TUXEDO supports migration of multiple servers residing on a machine. The servers must either consist of a group of servers or all the servers that reside on a machine. A group of servers within the TUXEDO application is defined as a collection of servers or services on a machine, often associated with a resource manager.
An administrator manually migrates servers using the TUXEDO application. The administrator specifies a primary machine and a secondary or back-up machine for each group of servers. Once a server group has failed or been deactivated by a user, a user may manually migrate the servers from the primary machine to the secondary machine. The primary then becomes the acting secondary machine, and the secondary becomes the acting primary machine. When the group of servers is to be moved back to the original primary machine, the user shuts-down the back-up machine and then migrates the server group back to the original primary machine.
Though a TLOG cannot be migrated by itself in Tuxedo, an administrator may manually migrate a TLOG file to a back-up server as a secondary step to of migrating a server. The TLOG migration is a manual process performed with tmadmin commands. To migrate a TLOG in TUXEDO, an Atmadmin@ session is started and all servers that write to the TLOG are manually shut-down by a user. Next, the user dumps the TLOG contents into a text file, copies the name of the TLOG file to the back-up machine, and reads the text file into the existing TLOG for the specified back-up machine. The user then forces a warm start of the TLOG. Though a user may manually migrate a TLOG in this manner, TUXEDO does not support having multiple TLOGs per server.
There are several disadvantages to the prior art such as the TUXEDO application. Tuxedo does not support the migration of anything less than a group of servers. Thus, if a single server has crashed in a system or requires maintenance, multiple servers must be shut-down in order to migrate the server. Tuxedo requires that all servers that write to a particular TLOG file must be shut-down while the TLOG file is migrated. Tuxedo also does not support multiple TLOGs residing on a single server. In Tuxedo, there is only one TLOG for a group of servers. Once servers of a machine or group have migrated, and the corresponding TLOG is migrated thereafter, the secondary machine hosts only the migrated TLOG. Additionally, all migration steps in Tuxedo are done manually, including a complete shut-down of the secondary server when failing back to the original primary or master server. What is needed is a migration system that addresses the deficiencies of existing migration systems such as the one in Tuxedo.