The development of telecommunications call processing or switching systems constructed from a distributed set of general purpose computing systems is emerging as an area of particular interest in the art. See, for example, H. Blair, S. J. Caughey, H. Green and S. K. Shrivastava, "Structuring Call Control Software Using Distributed Objects," International Workshop on Trends in Distributed Computing, Aachen, Germany, 1996; T. F. LaPorta, M. Veeraraghavan, P. A. Treventi and R. Ramjee, "Distributed Call Processing for Personal Communication Services," IEEE Conimunications Magazine, vol.33, no.6, pp. 66-75, June 1995; and TINA-C, Service Architecture Version 2.0, March 1995.
As noted in a paper published by T. F. LaPorta, A. Sawkar and W. Strom, entitled "The Role of New Technologies in Wireless Access Network Evolution," that appeared in Proceedings of International Switching Symposium (ISS '97), IS-03.18, 1997, systems employing distributed call processing architectures exhibit increased system scalability, performance, and flexibility. Additionally, advances in open distributed processing, such as the Common Object Request Broker Architecture (CORBA), described in "The Common Object Request Broker: Architecture and Specification," by the Object Management Group (OMG) Rev. 2.0, July 1995, facilitate portable and interoperable implementations of distributed software architectures in a heterogeneous computing environment. As is known, systems employing such technologies advantageously leverage a rapidly increasing price/performance ratio of"off-the-shelf" computing components.
The stringent performance and availability requirements of public telecommunications systems pose particular challenges to developing highly available distributed call processing systems which incorporate these off-the-shelf computing components. Specifically, and as noted by A. R. Modarressi, R. A. Skoog, in an article entitled "Signaling System No. 7: A Tutorial", which appeared in IEEE Communications Magazine, Vol. 28, No. 7, pp. 19-35, in July 1990, call processing software must process each call request within a few hundred milliseconds, and a switching system may not be out of service for more than a few minutes per year. As such, present day switching systems employ custom-designed fault-tolerant processors and special-purpose operating systems to meet these stringent requirements. In order for next generation switching systems to be built using general purpose computing platforms, software-based fault-tolerant methods and systems are required to achieve the same or similar performance and availability goals.
Two software methods for enhancing the level fault tolerance in a distributed computing environment that have been described in the literature are checkpointing and message logging. See, for example, E. N. Elnozahy, D. B. Johnson and Y. M. Wang, "A Survey of Rollback-Recovery Protocols in Message-Passing Systems," Tech. Report CMU-CS-96- 181, School of Computer Science, Carnegie Mellon University, October 1996, and R. E. Strom and S. Yemini, "Optimistic Recovery in Distributed Systems," ACM Transactions on Computer Systems, vol.3, no.3, pp.204-226, August 1985. Briefly stated, checkpointing involves periodically taking a "snapshot" and saving an entire state of a software process while messages sent or received by the software process are logged (message logging) between subsequent checkpoints. Assuming a piecewise deterministic execution model, and as described by Y. Huang and Y. M. Wang, in an article entitled "Why Optimistic Message Logging has not been used in Telecommunications Systems," that appeared in the Proceedings of the 25th International Symposium on Fault-Tolerant Computing, pp. 459-463, 1995, the state of the process can be later reconstructed during a recovery process by replaying logged messages in their original order. As observed by Y. Huang and C. Kintala, in "Software Fault Tolerance in the Application Layer," which appeared In Software Fault Tolerance (M. R. Lyu, Ed.), John Wiley & Sons, Chichester, England, pp.231-248, 1995, checkpointing, message logging, and "rollback" recovery techniques can be embedded into the operating system while remaining virtually transparent to application software.
Unfortunately, however, there are numerous disadvantages to these approaches when applied to distributed call processing systems. First, taking a snapshot of the entire process state may create a long period of time during which the process is unable to service requests from its clients, thereby increasing end-to-end call setup latency. Second, a single call request may involve a significant number of message exchanges between functionally distributed servers. Consequently, logging every message becomes too time-consuming to meet stringent call setup latency requirements of only a few hundred milliseconds associated with call processing. Additionally, if checkpoint intervals are made sufficiently long in an attempt to minimize checkpoint overhead, a prohibitively large number of messages my need to be replayed after a failure, thereby making recovery time unacceptably long. Consequently, a continuing need exists in the art for software-based fault-tolerant computing systems suitable for demanding telecommunications applications.