The present invention relates generally to the field of software error handling, and more particularly to resource manager failure handling in a distributed transaction multi-process environment.
In computing, the extended architecture (XA) standard is a specification by The Open Group for distributed transaction processing (DTP). It describes the interface between the global transaction manager and the local resource manager. The XA specification is a standard for multi-process global transactions performed across multiple resource managers. The XA standard specifies the XA interface, or switch, which is the bidirectional interface between a transaction manager (TM) and a resource manager (RM). The TM manages the connection and the transaction coordination to all the resource managers. However, all the work performed on the resources is done by the application program, for example, database updates, such as SELECT and INSERT. The XA interface is not an ordinary application programming interface (API). It is a system-level interface between DTP software components.
The goal of the XA standard is to allow multiple resources (such as databases, application servers, message queues, transactional caches, etc.) to be accessed within the same transaction, thereby preserving the atomic, consistent, isolated, and durable (ACID) properties across applications. Atomic refers to a property in which work units must succeed or fail in an all-or-nothing manner, preventing partial updates to databases. Consistency refers to maintaining application constraints and that future transactions see the effects of past transactions. Isolated refers to how transaction integrity is visible to concurrent multiple users and systems. Durability is a property of transactions that insures a committed transaction remains as such, and in distributed transactions involves the coordination of participating systems.
The XA standard makes use of a two-phase commit to ensure consistency that all resources either commit to complete or rollback any particular transaction, as a type of consensus protocol. The XA standard specifies how a transaction manager will roll up the activities, or activities, of a transaction against the different data-stores into an “atomic” transaction and execute this with the two-phase commit protocol for the transaction. Thus, the XA standard is a type of transaction coordination, often among databases or other resources. The XA standard coordination allows many resources to participate in a single, coordinated, atomic operational step of a transaction.
The XA standard specification describes what an RM must do to support transactional access, such as each RM providing a switch that gives the TM access to the RM's call routines. The switch contains the RM's name, pointers to entry points, registration flag and other information used by a transaction manager in connecting with RMs. Providing the information allows the set of RMs linked with an application to be changed without having to recompile the application.
In a DTP environment, the transaction manager manages the connection and the transaction coordination to all the resource managers. However, all the work performed on the resources is done by the application running on the resource, for example, database updates, such as SELECT, and INSERT.
The TM adopts a multi-process model to run applications concurrently, and in such an environment, middleware may cache connection handles acquired through XA open requests (an XA initialization process) that may be invalidated during RM failures. In case of an RM failure, the process detects the failure of a subsequent RM specific XA request issued by the TM. It means the timing of failure detection and refreshing connection handle after recovering the failure differs in each process. In this environment, the application program (AP) has to handle all the resource specific errors gracefully, which requires an application programmer to consider how to handle the errors related to resources managed by RMs (such as communication failures, planned or unplanned shutdowns,) as well as other logical resource errors. The reason why the application must handle errors is because the communication happens directly between the application and the resource and there is no control for the middleware processes to intercept and handle the errors.