The present invention relates in general to synchronization of a plurality of CPUs in a fault tolerant system and in particular to methods and systems for establishing synchronization between a primary CPU and a newly added backup CPU where the CPU units are normally operating in an active-replication mode.
There are many types of fault tolerant groups of central processor units (CPUs). Two of these are designated in the art as active-replication and active-standby.
These fault tolerant systems are widely known and are discussed in various periodicals and books. A good reference on the subject is in a book entitled xe2x80x9cDistributed Systems (Second Edition)xe2x80x9d authored by Sape Mullender, published by Addison-Wesley Publishing Company and copyrighted in 1993, and incorporated herein by reference in its entirety. Active-replication and active-standby are among the subjects discussed. In particular, the material on pages 97-138 and 464-481 of xe2x80x9cDistributed Systemsxe2x80x9d is believed pertinent as background material. More detail on one active-standby system may be found in a patent application having Ser. No. 09/408,619, filed Sep. 30, 1999, now U.S. Pat. No. 6,345,282, entitled xe2x80x9cMULTI-PROCESSOR DATA SYNCHRONIZATION METHOD AND APPARATUSxe2x80x9d to Corey Minyard, assigned to Nortel Networks Corporation, and incorporated herein by reference in its entirety. Total order systems are also discussed in both the referenced book and the referenced co-pending application.
An active-replication system, once a primary or active CPU is synchronized to all backup CPU(s), operates upon the principle that all incoming messages and data are received and manipulated in the same manner by all CPUs in the group. In other words the backup CPUs are doing exactly the same processing as is the primary CPU. A problem with such a system is that applications running in such prior art systems can not be synchronized without stopping processing during the synchronization. Additionally, if no transactional messages are to be lost, such messages must be stored in very large message queues for all CPUs involved while synchronization is taking place. Thus while an active-replication system has very desirable normal operation characteristics, the synchronization characteristics leave a great deal to be desired.
An active-standby system, once synchronized, passes check-point messages from the primary CPU to all backup CPUs to update the data in each of the backup CPU databases. Additionally, each backup CPU maintains a list of messages to be processed that are received at the same time that the primary CPU receives the message. The messages to be processed are discarded by the backup CPU(s) when the backup CPU receives a check-point message corresponding to the message to be processed. Many prior art active-standby systems having backup CPUs have required the stoppage of processing of incoming messages while data is being synchronized.
Other prior art methods of obtaining synchronization involve the transfer of all the data records of the primary CPU to the newly online backup CPU enough times to make sure that all the records that were changed during the first transfer have been properly updated in the backup CPU.
The referenced co-pending patent application operates in accordance with the idea of continuing processing by the main CPU while it is bringing a new backup CPU into synchronization. This is accomplished by having all external messages, received by the backup CPU subsequent to the commencement of data synchronization and that are to be processed by the primary CPU, stored in a message list of the backup CPU. Check-point message data is intelligently stored by first deleting related external messages from message list storage and then creating a record if none exists and filling only those fields referenced in the check-point message. If, on the other hand, a record does exist, only the check-point message data fields are altered in that existing record. When a data synchronization record is received by the backup CPU, a check is made to see if such a record has already been created by a check-point message. If not, a record is created in the backup CPU database and all the fields are made to correspond with the received data synchronization record message. If such a record is found, only those fields not already containing check-point data are filled from the received data synchronization record message. In this manner a single pass through the primary CPUs database is sufficient to obtain data synchronization of the backup CPU.
In a cellular telephone system, involving thousands of customers, the data transfer time required to synchronize a newly online backup CPU, while the system is running, may take many hours when using prior art synchronization approaches. In such a system, the large data stores, high transaction rates and low downtime requirements mandates that newly online backup CPUs be able to synchronize without special memory or queuing requirements and in a minimal time. Known prior art active-replication systems either stop processing or do message queuing during synchronization. Such fault tolerant system limitations can not be tolerated in the environment of present day cellular telephone systems.
Since active-replication systems eliminate the requirement of passing check-point messages from the primary CPU once a backup CPU is synchronized, the primary CPU has more time available for processing data than do active-standby systems having the same theoretical processing power. It would thus be desirable for an active-replication system to be able to synchronize a backup CPU to a primary CPU without discontinuing processing and without requiring hardware to maintain an extremely large message queue while performing such a synchronization.
The present invention accordingly provides an active-replication system which can synchronize a backup CPU to a primary CPU without discontinuing processing and without requiring hardware to maintain an extremely large message queue while performing such a synchronization. To that end, the present invention comprises a fault tolerant processing system using total order, having a primary CPU normally operating in an active-replication mode, and a backup CPU interconnected to the primary CPU and that requires synchronization with the primary CPU. An xe2x80x9cadd mexe2x80x9d request signal is sent from the backup CPU to the primary CPU to cause the primary CPU to temporarily switch to an active-standby mode. A xe2x80x9cfinishedxe2x80x9d signal is sent from the primary CPU to the backup CPU when copies of all data synchronization records have been transmitted to the backup CPU. Both the primary and the backup CPUs are caused to revert to an active-replication mode substantially immediately after transmission of the xe2x80x9cfinishedxe2x80x9d signal.