The invention relates to fault tolerant server systems, and more particularly to fault tolerant server systems including redundant servers.
High availability of service in a telecommunication system can be achieved by means of fault tolerant computers or distributed system architectures. The use of this redundancy, however, may adversely affect other system properties. For example, the utilization of redundancy on the hardware level increases cost, physical volume, power dissipation, fault rate, and the like. This makes it impossible to use multiple levels of redundancy within a system.
For example, distributed systems can incorporate replication between computers, in order to increase robustness. If each of these computers are fault tolerant, costs will multiply. Furthermore, if backup copies are kept in software, for the purpose of being able to recover from software faults, the cost of the extra memory will multiply with the cost of the fault tolerant hardware, and for the multiple copies in the distributed system. Thus, in order to keep costs low, it is advisable to avoid the use of multiple levels of redundancy. Since the consequence of such a design choice is that only one level of redundancy will be utilized, it should be selected so as to cover as many faults and other disturbances as possible.
Disturbances can be caused by hardware faults or software faults. Hardware faults may be characterized as either permanent or temporary. In each case, such faults may be covered by fault-tolerant computers. Given the rapid development of computer hardware, the total number of integrated circuits and/or devices in a system will continue to decrease, and each such integrated circuit and device will continue to improve in reliability. In total, hardware faults are not a dominating cause for system disturbances today, and will be even less so in the future. Consequently, it will be increasingly more difficult to justify having a separate redundancy, namely fault tolerant computers, just to handle potential hardware faults.
The same is not true with respect to software faults. The complexity of software continues to increase, and the requirement for shorter development time prevents this increasingly more complex software from being tested in all possible configurations, operation modes, and the like. Better test methods can be expected to fully debug normal cases. For faults that occur only in very special occasions, the so-called "Heisenbuggs", there is no expectation that it will be either possible or economical to perform a full test. Instead, these kinds of faults need to be covered by redundancy within the system.
A loosely coupled replication of processes can cover almost all hardware and software faults, including the temporary faults. As one example, it was reported in I. Lee and R. K. Iyer, "Software Dependability in the Tandem Guardian System," IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, vol. 21, No. 5, May 1995 that checkpointing (i.e., the copying of a present state to a stand-by computer) and restarting (i.e., starting up execution from a last checkpointed state by, for example, reading a log of the transactions that have occurred since the last checkpoint and then starting to process new ones) covers somewhere between 75% and 96% of the software faults, even though the checkpointing scheme was designed into the system to cover hardware faults. The explanation given in the cited report is that software faults that are not identified during test are subtle and are triggered by very specific conditions. These conditions (e.g., memory state, timing, race conditions, etc.) did not reoccur in the backup process after it took over; consequently, the software fault does not reoccur.
A problem with replication in a network is that there are a few services, such as arbitration of central resources, that do not lend themselves to distribution. This type of service must be implemented in one process and needs, for performance reasons, to keep its data on its stack and heap. To achieve redundancy, this type of process must then be replicated within the distributed network. In a high performance telecommunication control system this replication must be done with very low overhead and without introducing any extra delays.