The present invention relates generally to a method and mechanism for providing a non-stop, fault-tolerant telecommunications system and, more particularly, to a method and mechanism which provide online testing, replacement and modification of improperly functioning portions of the system and upgrading of portions of the system.
Numerous telecommunications systems are currently available which employ sophisticated computer systems to provide services to customers. Many of these telecommunications systems operate in environments wherein the system cannot be down for any length of time. For example, emergency vehicle communications systems must remain operational during failure of portions of the system, or even the system itself.
Current systems have attempted to provide this non-stop, fault-tolerant operation by employing various methods. One common method is to provide a complete backup system. If the main system malfunctions, the main system is replaced online by the backup system. However, as is apparent, having a complete backup system is relatively expensive, occupies significant space and requires significant maintenance.
Another method for attempting to achieve non-stop, fault-tolerant operation is to provide for hardware replacement, such as board replacement and operating system upgrades in a multiprocessor environment. Unfortunately, such methods interrupt system operation and customer service.
An important feature of current communications systems is their ability to permit upgrades. Although various methods have been developed to perform system upgrades, all known methods unfortunately result in a disruption of system operation. In one method, the system is halted during an upgrade and another version of the system is started. Even though only a portion of a system is typically upgraded, the whole system is affected. Because of the various task interdependencies in current systems, eliminating (or halting) one task may result in overflows in queues, timers being expired, messages not being received and, ultimately, system crash. As those skilled in the art will readily comprehend, such problems are especially unacceptable for real-time or emergency systems.
Accordingly, there is a need in the art for a method and mechanism for providing non-stop, fault-tolerant operation of a telecommunications system that permits replacements, modifications and upgrades without interrupting operation of the system.