Embodiments of the present invention relate generally to methods and systems for providing high availability processes and more particularly to providing high availability by decoupling an application session from processing of a supporting protocol.
High availability of a process such as an application supporting a communication session is achieved by replicating the session between nodes of a cluster so that if one node fails or otherwise becomes unavailable, another node or set of nodes can take over support of that session. An example of such replication is the group processing provided by JGroups technology toolkit in JEE middleware. In this and other systems, the session and application level data is replicated on multiple nodes of a cluster using different strategies, e.g., reliable multicast etc, unicast, and different replications (one-to-two, one-to-n, etc.). Upon failure of the node supporting the session detected via different means like hardware failure detection or middleware monitoring (e.g. via heart beat), the session is switched to one of the other nodes on which the session is replicated, e.g., network resources are informed of the failure and switch to the other nodes. Since the session, including the application level data, is replicated on the other node, the session can be rebuilt and resumed on the other node. This is often referred to as service/application availability in that as soon as a failure occurs, the service or application is again available for new transactions, sessions, calls, etc. Another approach to providing high availability is demonstrated in Oracle Coherence that uses a distributed cache that can replicate sessions in a replica of the cache in a grid computing environment (i.e. set of nodes). Technologies such as JGroups or Oracle Coherence thus provide high availability for a session supporting a particular protocol.
When multiple protocols are involved in a particular session, the protocol messages are sent from one node to another. For example, protocol specific load balancers or routers and other protocol specific mechanisms can send the protocol messages to a node supporting the session in that protocol, i.e., which processes the protocol messages and supports the session. This first node is in turn replicated on a second node of the protocol. Upon failure of the first node, the session can be recovered on the second node which is replicating the session. So for example, following failure of the first node and a load balancing action in one of the protocols (e.g., Hypertext Transfer Protocol (HTTP)), the traffic of this first protocol may be sent to a new node since the load balancer or router is informed of the failure. However, the traffic of the other protocol (e.g., Session Initiation Protocol (SIP)) is not modified and not aware of what may have happened on the HTTP side if only HTTP was affected. In case of a hardware failure, it may be aware of the failure but may be sent to a different machine from where the HTTP load balancer decides to send the HTTP traffic. Therefore, the SIP traffic will continue to go to the first node or to another one, not the same as the one where the traffic for the first protocol was redirected. If the first node has since recovered, the SIP traffic will be processed by the now recovered first node even though the HTTP traffic is now being processed by the second node. All these variations can lead to a stalemate or “ping-pong” effect in the sessions where the session data is brought back to where the latest protocol message arrived and if these are different machines, the sessions go back and forth. Hence, there is a need for improved methods and systems for improved high availability processing.