The present invention relates generally to transaction processing systems (TPS) and more particularly to recovery from failures during transaction processing to multiple databases which minimizes user intervention.
Commonly, human end-users are exposed to many different failures and error situations in systems which are called transaction processing systems (TPS). TPSs are three-tier (client-server-database) systems which allow client applications to perform database transactions. For example, there are various reservation systems, such as for airlines, hotels, and car rentals, and financial systems, such as banking, credit card, and automated teller machines. In these systems, a customer or sales representative uses a client application that allows a user to query and update a database. The client interface allows the client to specify which database to add information to or to update. If a failure occurs for example during an update, it is difficult for the client to know whether the update was actually performed or not.
As an example, for an Automated Teller Machine (ATM) failure, it is likely that the ATM customer would need to call the bank to find out whether the transaction was completed properly. This would generate more work for hank employees and create unsatisfied customers and very costly in terms of the lost business, reduced productivity, and unsatisfied customers.
Essentially, the client needs to query the database and see if the update has been performed and if it has not, to reissue the update. Even this solution is, however, not failure proof.
First, the system might not know what the values were before the update, and if the update is relative to the old value, the client might not be able to determine whether the update was performed or not. For example, if the ATM transaction involved a deposit to an account, the bank employee would have to have information regarding the previous balance and any other deposits and withdrawals that may have occurred to the account around the time of the failure.
Second, another client might have done an update after the first client""s query and failed update. Therefore, the first client will not be able to determine with confidence whether the update was performed or not. Thus, the first client would have to guess what to do.
Thus, involving the client in the recovery of a failed request should be avoided to the largest possible extent. Unfortunately, complex commercial database systems and transaction processing systems generally fail to provide client transparency, or the ability to detect and correct problems without human intervention. Rather, it is expected that the failure recovery either be handled by the client or be coded into applications.
Embedding the failure recovery into the application code complicates the application considerably and is error prone.
Implementing error recovery logic is difficult and complex. Client applications as well as application servers currently need to implement error recovery as part of the application specific code. Further, the error recovery logic is not necessarily reusable for any application adhering to the described architecture.
In a TPS, the client application code demarcates the transaction. If the transaction fails, the client application retries the transaction. This might or might not involve action from the end user. In the worst case the end client might need to reissue the transaction. There is a key problem with this approach in that there is a window in which a failure can occur when the client application does not know the outcome of the transaction. In the worst case the client needs to manually check if the transaction was committed and then take the appropriate action.
Replication protocols, such as ISIS and Horus, both from Cornell University, allow a service to be implemented by multiple servers, called a server group. Each request issued by a client application will be sent to all the servers. The protocol ensures that the requests processed by the servers of a group are processed synchronously according to some synchronization criteria. The states of servers are kept consistent by ensuring this synchronization. When new servers join a group the state is transferred from a server that is up to date.
The main idea of replication protocols is that if the states are kept consistent for two or more servers, one of them can fail without impacting the delivery of the service.
It is hard, and often impossible, to use replication solutions to solve reliability of services using database applications. The reason is that multiple servers can not simultaneously coordinate their writes to the same database. If different databases were used it would be difficult to keep the databases consistent. Both these problems are complicated further by the fact that databases can be accessed by other applications, causing the databases to diverge.
Thus, it is very difficult and often impossible to use replication protocols for the type of database (or state aware) applications for which the error recovery problem needs to be solved.
Even if it were simple to use replicated systems for database applications, they do not provide any support for automatically recovering from a failure during a database transaction. A client application would still need to determine the outcome of the transaction.
Basically, although replication protocols can provide error recovery to certain types of systems, they are not suitable for database applications. Secondly, it would still be necessary to determine the outcome of failed calls in some manner.
Traditional, high availability solutions for database-centric applications are typically based on clusters. A cluster consists of multiple computers, called nodes. Each node is capable of running a database, and when the database fails, it is restarted by cluster manager software. The consistency model for the database is based on the notion of xe2x80x9crollbackxe2x80x9d where the database is restarted in some previous, consistent state. The transactions that were being executed when the database failed are aborted.
Traditional cluster mechanisms limit database down time because they immediately restart the database. However, the failure is visible to the database clients. The clients may lost connections to the database, aborted transactions, and indeterminate transactions where the outcome cannot be determined.
The Microsoft Transaction Service (MTS) provides a programming model and runtime system for three-tiered, component-based systems. MTS provides transactional semantics for the components in the middle tier. These components are annotated communications (COM) components where the annotations capture transaction composition across components. From an application architecture perspective. MTS supports a programming model in which the middle tier contains stateless objects and transactions are demarcated in the middle tier.
However, MTS does not provide error-recovery logic. Client applications using MTS-controlled objects must manually and explicitly implement error recovery logic to determine the outcome of database transactions. Moreover, client applications will have to explicitly handle failures of middle-tier objects: they must catch exceptions that occur when middle tier objects fail and they must subsequently connect to a new middle tier object.
An answer has long been sought to solve the above problems, but it has also long eluded those skilled in the art.
The present invention is targeted to three-tier transaction processing systems (TPSs) built up as: one or more client applications (CAs), one or more server applications (SAs), and one or more database systems (DBS). The client application implements an application that requires data and services that are best realized as distributed resources. A SA represents such a resource. The SA provides a service that can be shared among multiple CAs. SAs store their data in multiple databases. A client transparency mechanism (CTM) and a server transparency mechanism (STM) are added, and both can be represented as conventional state machines. A database on a clustered node is used for the database management. SAs implement transactional behavior. The STM implements the server side of the protocol so that the CAs may recover from SA and database failures. The cluster application programming interface (API) is used to determine when to retry. Information is stored in the STMs so that the outcome of the transaction can be determined. Thus, since most failures can be automatically recovered, the system provides very high-availability from a client""s perspective.
The present invention provides a simpler programming model for CAs and SAs where the error recovery logic is embedded in middleware.
The present invention further provides a TPS in which CAs do not need to do outcome determination after a failure and a subsequent system recovery.
The present invention further provides a TPS in which SAs do not need to explicitly reconnect to database or request retry from the CA.
The present invention further provides a TPS in which the solution masks communication and node failures for the CA.
The present invention further provides a highly-availability TPS which uses a data store with a notion of transaction.
The present invention further provides a TPS in which the CA uses stateless servers that uses data-stores to save information and system state.
The present invention further provides a TPS in which an operation request will correspond to one transaction.
The present invention further provides a TPS in which data-stores are made highly available using a fault-tolerance solution, such as a cluster.
The above and additional advantages of the present invention will become apparent to those skilled in the art from a reading of the following detailed description when taken in conjunction with the accompanying drawings.