The present disclosure relates generally to reliable communication systems and, in particular, to methods, systems, and computer program products for continuous availability of non-persistence messages in a distributed platform.
In a real-time or near real-time computing environment, messages sent and received between system components in a distributed platform must have a minimal delay while providing reliable communication. An example of such an environment is a 300 mm semiconductor fabrication (FAB) manufacturing environment. Such demanding environments cannot afford to lose critical messages or tolerate extended delays, which could result in halting operation and potentially damaging equipment or products. Critical messages may include control commands, health status, or other information necessary for proper system performance. A common technique employed to send messages between a source and target application is through message queue (MQ) communication. One approach to providing reliable MQ messaging is through persistent messages. Persistent messages may be stored by the message transmitter to ensure that in the event of a failed communication attempt, the message can be resent. While the use of persistent messages can increase system reliability, persistent messages may be unsuitable for a real-time or near real-time environment as there is additional delay introduced through persistent messages. For example, a near real-time system may require a maximum latency of 100 milliseconds to maintain system integrity; however, persistent messages may have a latency of 1 to 2 seconds. In a long delay period, critical messages may be lost or overwritten as new messages can arrive before existing messages are processed.
Another approach to providing reliable system performance for critical applications is through the use of high availability cluster multi-processing (HACMP). While HACMP may be effective at providing reliable system performance, the delays required can exceed 2 to 3 minutes for recovery when a system failure occurs. During this delay period, other system components may timeout resulting a larger scale impact, including a potential loss of revenue in a manufacturing environment. HACMP systems can be very expensive to implement, and thus not warrant the expenditure depending upon the revenue generating potential of the operating environment. HACMP systems can provide a high degree of availability, but fault recovery delays and implementation expense make them unsuitable for near real-time, cost-sensitive environments.
Many existing approaches to provide a high degree of system reliability and availability are not well suited to real-time or near real-time environments. Other considerations in systems that use backup storage and recovery approaches include indeterminacy of recovery time. That is, some systems may recover in a timely fashion upon certain failures, while recovery may take longer for different failure modes. The risk of a system failing to recover in a timely fashion for a large range of failures makes such a system unsuitable for critical applications that cannot afford downtime. Another factor that must be considered in a high availability system is the ability to perform system maintenance and upgrades without bringing the entire system offline. System components such as remote clients must still be able to communicate with other clients in a distributed platform, while maintaining a rapid response time and not losing critical messages.
What is needed, therefore, is a way to provide continuous availability of messages without relying on message persistence in a distributed platform.