1. Technical Field
This invention relates to the field of distributed messaging and more particularly to a distributed messaging system for transmitting topical data messages from data publishers to data consumers.
2. Description of the Related Art
Conventional distributed computing can require that multiple computer application processes share data over a computer communications network. Specifically, distributed computer applications can require that processes running in one computing device share data with one or more processes running in other computing devices communicatively connected to one another in a computer communications network. Communications between distributed applications can require significant coordination and control in order to ensure that data which is received is correct and accurate. Disruption in communications can be catastrophic if the distributed applications cannot adequately detect a break in communications and respond accordingly.
In a typical distributed system there may be hundreds of computers running many application programs. In consequence, sharing data entails not only establishing a means of communication between the application programs across a network but also providing the capability to recover from failures. These failures may be due to physical network problems, software problems, or other error conditions. Recovering from a fault or problem is a critical issue in the arena of distributed computing.
The problem of communications disruption in distributed computing has been addressed in U.S. Pat. No. 5,887,127 for Self-Healing Network Initiating Fault Restoration Activities from Nodes at Successively Delayed Instants issued on Mar. 23, 1999 to Saito et al., and in U.S. Pat. No. 5,390,326 for Local Area Network with Fault Detection and Recovery issued on Feb. 14, 1995 to Shah. Both Shah and Saito illustrate how previous work in providing resiliency in communications between processes has focused on the underlying communications network. Specifically, Shah teaches the generation and transmission of a heartbeat signal from various nodes in a network in order to monitor the network for the occurrence of a fault. In contrast, Saito is directed towards coordinating fault recovery among several nodes in a network. In particular, Saito provides for time-staggered fault recovery among the various nodes in a network. Additionally, U.S. Pat. No. 5,319,774 for Recovery Facility for Incomplete Sync Points for Distributed Application issued on Jun. 7, 1994 to Ainsworth et al. focuses on the re-synchronization of database files across disparate operating environments subsequent to the occurrence of a communications fault.
Still, neither Shah, Saito, nor Ainsworth teach a distributed messaging system capable of recovering and re-synchronizing interprocess communications between data publishers and data consumers in a distributed messaging system. Yet, distributed applications are increasingly utilizing asynchronous communications, typically in the form of messages between processes, as the means for sharing data and providing notification of events between application processes. In the event of a loss of communications, either through network failure or the failure of one of the communicating processes, messages may be lost which can adversely affect the correct operation of the distributed system. Problems arise when attempting to restore communications between two applications in a distributed system because the processes involved in the communications must re-synchronize the message flow between them. This requires significant information to be maintained by each process involved in sending or receiving messages.