As demand for automation increases, enterprise applications that are highly available are better positioned to fulfill business requirements and keep up with market competition. In addition to mission critical requirements such as scalability, high availability, distributed, time sensitivity, performance, modularity, loose coupling, and many more, enterprise applications that communicate with other internal or external enterprise systems are better equipped to fulfill business requirements. With an ever increasing dependency on IT infrastructure to perform critical business processes, the availability of the IT infrastructure is becoming more important. Failure of an IT infrastructure may result in large financial losses, which increase with the length of the outage. Thus, careful planning may be required to ensure that the IT system is resilient to any hardware, software, local or system-wide failure. An application environment may be considered highly available if it possesses the ability to recover automatically within a prescribed minimal outage window. Therefore, an IT infrastructure that recovers from a software or hardware failure, and continues to process existing and new requests, may be considered highly available.
Factors that can cause a system outage and reduce availability fall into two categories: planned and unplanned. Planned disruptions may be related to systems management (e.g., upgrading software or applying patches), or to data management (e.g., backup, retrieval, or reorganization of data). Conversely, unplanned disruptions may be related to system failures (e.g., hardware or software failures) or to data failures (e.g., data loss or corruption).
Enterprise Application Integration (EAI) is an integration framework composed of a collection of technologies and services which form a middleware to enable integration of systems and applications across the enterprise. EAI tools such as Message Oriented Middleware (MOM) software include features to fulfill enterprise application requirements. In some IT infrastructures, MOM technologies such as WebSphere MQ and/or other software and hardware make applications highly available. In some instances, Websphere MQ is clustered to achieve high availability.
WebSphere MQ is a MOM product, available from IBM, which functions to transfer a datagram from one application to another on one computer system, or from one application to an application running on another computer system. When persistent messaging is used, WebSphere MQ logs messages to disk storage. Therefore, in the event of a failure, the combination of the message data on the disk plus the queue manager logs can be used to reconstruct message queues (MQs), restoring the queue manager to a consistent state at the time just before the failure occurred. MQs include message queues and mailboxes that are software engineering components used for intercrosses communication, or for inter-thread communication within the same process. Such MQs are used for messaging—the passing of control of content, and group communication systems may provide similar kinds of functionality.
In this context, a recovery involves completing normal unit of work, with in-flight messages being rolled back, in-commit messages being complete, and in-doubt messages waiting for coordinator resolution. Various solutions use WebSphere MQ to improve availability: an active-passive solution using a shared disk; an active-active solution using WebSphere MQ queue manager clusters; and an active-active solution using WebSphere MQ queue manager clusters and a shared disk.
In the active-passive solution, when a queue manager fails, a restart is required to make the local message queues available again. Until then, the messages stored on the queue manager will be stranded. In this solution, a second node is used as a passive node without its resources being used. The passive node becomes active when a failover is induced. In this process, the queue manager data files and logs are stored on an external shared disk that is accessible to one of the two nodes at any given time. The external disk used in this solution needs to be fail proof to prevent the external disk from being a single point of failure. In a normal operation, the shared disk is mounted on the active node, which uses the storage to run the queue manager in the same way as if the shared disk were a local disk, storing both the queues and the WebSphere MQ log files on the shared disk. When a failure is detected on the active node, the failover process is induced automatically and then the passive node takes over the role of the active node, mounts the shared disk and starts the queue manager. The passive node reads the logs and queue manager's state from the shared disk to return to the correct state and resume normal operations. This failover operation can also be performed without the intervention of a server administrator, requiring external clustering software to detect the failure and initiate the failover process.
Clustering software is sometimes used in conjunction with the active-passive solution. High availability clustering software addresses high availability issues using a more holistic approach than individual applications. This clustering software groups applications and other hardware and software resources into groups called resource groups. High availability clusters (also known as failover clusters) are implemented primarily for the purpose of improving the availability of services that the cluster provides. High availability clusters operate by having redundant nodes, which are then used to provide service when system components fail. The most common size for a high availability cluster is two nodes, which is the minimum requirement to provide redundancy. High availability cluster implementations attempt to use redundancy of cluster components to eliminate single points of failure. When failure occurs in one of the applications in the group, the entire group is moved to a standby node. Several vendors provide clustering. Some solutions, such as Veritas Cluster Server and SteelEye LifeKeeper, are also compatible with multiple platforms to provide a similar solution in heterogeneous environments.
While the active-passive solution may be useful for messages that are delivered once and only once, and clustering software may make an existing application and its dependent resources such as database message queue highly available, there are a number of drawbacks associated with the active-passive solution. For example, the solution requires additional hardware (e.g., shared disks) and external clustering software (e.g., VERITAS), which increase administration costs associated with administration of the components. Additionally, the resources on the idle passive node will not be utilized. Further, the queue manager is not available while the failure is being detected and until the services are restored on the passive node. Moreover, the client (i.e., application or system that accesses a remote service on another computer system) must handle the outage during the failover which might take from a few seconds to minutes. Finally, the application will not be available until the failover process is complete.
An alternative to the active-passive solution is the active-active solution using WebSphere MQ queue manager clusters. A WebSphere MQ queue manager cluster is a cross platform workload balancing solution that allows WebSphere MQ messages to be routed around a failed queue manager. The WebSphere MQ queue manager allows a queue to be hosted across multiple queue managers, thus allowing an application to be duplicated across multiple machines. The WebSpehere MQ queue manager provides a highly available messaging service allowing incoming messages to be forwarded to any queue manager in the cluster for application processing. Therefore, if any queue manager in the cluster fails, new incoming messages continue to be processed by the remaining queue managers. While WebSphere MQ clustering provides continuous messaging for new messages, it is not a complete solution because it is unable to handle messages that have already been delivered to a queue manager for processing. Thus, when the local queue manager fails, the local client will not be able to send any messages until the queue manager is brought up. The active-active solution using WebSphere MQ queue manager clusters may be useful for workload balancing across distributed systems, may allow for alternative queue managers to handle the load transparently when a queue manager goes down, and may be able to scale applications linearly through the use of new queue managers added to the cluster to aid in the processing of incoming messages. However, there are a number of drawbacks associated with this solution. For example, there is no way to process the messages that have already been delivered to a queue manager that just failed. Additionally, the application will be required to handle the outage until the local queue manager is restarted, failing to send and consume messages.
A combination of the active-passive solution and the active-active solution using WebSphere MQ queue manager clusters may provide a better solution which deals with time sensitive messaging. Thus, WebSphere MQ clustering with the recovery techniques such as shared disks with clustering software may be used. Such a solution may be implemented by combining external clustering technology with WebSphere MQ queue manager clusters, providing combined benefits for achieving high availability. WebSphere MQ clustering with high availability software enabled shared disks may eliminate issues associated with stranded messages by processing such messages via other active nodes. However, the queue manager in such a solution is not available during the failure detection and until the services are restored on the passive node. Additionally, the client must handle the outage during the failover, which might take from a few seconds to minutes. Further, the application is not available until the failover process is complete. Finally, such a solution is not an ideal and cost-effective solution for every enterprise application, as it incurs extra costs for additional hardware and software.
While several vendors are available in the market to provide software or hardware to address high availability related issues, in many scenarios, vendor software requires complex installation and configuration steps. Further, during the failover, the client application must handle the broken connection and wait until the failover is complete. In a mission critical application like visual voice mail, such delay may not be acceptable.
Applications using existing solutions are prone to failure for a number of reasons: configuration and tuning the software is complex and requires a highly specialized administrator; due to the complexity in cluster and software configuration, incorrect configuration can cause many production outages and service interruptions; setting up shared disks and software incurs extra administrative costs and this kind of set up needs a dedicated administrator for monitoring and support; the queue manager will not be available during the failure being detected and until the services are restored on the passive node; the messages on the failed queue manager are stranded until the queue manager is restarted; single point of failure; the client will have to handle the outage during the failover which might take from a few seconds to minutes; the application will not be available until the failover process is complete; there are licensing and maintenance costs associated with software; no control on failover mechanism; performance overhead with complicated cluster configuration; achieving a repeatable configuration process takes a lot of time, resources, and documentation overhead.
Hence a need exists for an improved system for ensuring high availability for an enterprise application.