The successful completion of applications depends on the fault free operation of critical system components. In distributed computing systems, these critical system components typically include application processes, devices (clients or servers) on which application processes execute and the communication mechanism used to communication between them. However, any of these components may fail during operation. Such failures may have implications for a user ranging from mere annoyance to significant financial losses. Therefore, from a user's perspective, there is a need for system reliability. Reliability is the property of a computing system that allows it to run continuously without crashing. In situations where it may not be possible to avoid all component failures, reliability from a user perspective can be provided by masking these failures. Fault tolerance allows a system to run and offer its services to a user even in the presence of failures.
Messaging is considered a key communication mechanism in distributed systems, where it is a popular choice for applications that require a high degree of reliability, e.g. web services, remote procedure calls and e-commerce transactions. Messaging allows applications to communicate to each other via message passing, ensuring that messages are delivered according to application specified delivery semantics, such as at most once, at least once and exactly once.
A message-based communication system that is fault tolerant, referred to as a reliable messaging system, ensures the reliable delivery of messages according to specified delivery semantics despite device (client or server) and network failures. This is accomplished by making the reliable messaging system fault tolerant to various types of failures, which may require implementing different fault tolerance schemes for fault detection or recovery. Additionally, a reliable messaging system may support asynchronous operation, which imposes no limit on the time it takes to send or receive messages over a network. Asynchronous operation allows interconnected devices to communicate with each other even if one of the devices is temporarily unavailable using point to point messaging or a centralized messaging or queuing server.
Fault tolerance usually requires some sort of redundancy, e.g., an application may have to save its state periodically to stable storage in order to ensure that it can recover from failures. Research has shown that there is a significant trade-off between the level of fault tolerance, which includes reliability guarantees and recovery speed, and the system performance during failure free operation. This trade-off results from the varying amounts of computing overhead associated with message logging, fault detection and recovery operations for different fault tolerance schemes. Accordingly, an application may wish to specify precise fault tolerance and performance requirements for a reliable messaging system. These requirements may vary over the course of execution of an application and may differ among applications.
In addition, reliable messaging systems will need to operate in computing environments that may have great heterogeneity among the networks, applications/services and devices forming part of these environments. For example, a wireless environment may include changing networks, changing network conditions, including frequent disconnections, asymmetric networks and networks with unpredictable delay and loss characteristics. In addition, various applications executing within a wireless environment may impose changing service characteristics and service requirements for reliability and fault tolerance. Also, wireless environments may include heterogeneous devices having different processing power, changing load, storage, memory and battery resources.
Traditional techniques for implementing reliable messaging for distributed systems have primarily focused on static reliable messaging systems. These systems are unable to adapt to changing conditions in a heterogeneous environment. Known reliable messaging systems can provide only limited levels of fault tolerance and rely on fixed transport protocols, usually the Transport Control Protocol (TCP), which may not be optimized for a heterogeneous or wireless environment.
Therefore, there is a need for an improved reliable messaging system that can provide dynamic re-configurability and fault tolerance in a heterogeneous computing environment.