This invention relates generally to fault tolerant computer systems and more particularly to redundant processor computer systems used in support of telephone systems.
Modem telephone systems handle large volumes of time critical information on a routine basis. In such systems fault tolerance is a high priority and the need exists for a redundant processor system. APZ is an example of a computer system in which both processors execute in lockstep. Tandem Integrity is an example of a triple redundancy processor system. In fault tolerant systems there are at least two Central Processing Units (CPU's) that run in parallel where one of the two CPU's is always in an Executive (EX) state and the other is in a Stand By (SB) state. Both CPU's run the same microcode and execute the same instructions. The difference between the EX CPU and the SB CPU, as the two processors will be referred to, is that the only CPU whose output is actually used by the system it supports is that of the EX CPU. Of course, as is normal in fault tolerant systems, if the EX CPU should ever fail, or otherwise be taken out of operation, the output connections would be immediately switched to the SB CPU. In this manner the SB CPU could take over the processing chores of the system at any time, thus making the system fault tolerant. Examples of well known CPU's include the X86 family, Pentium and Pentium II CPU's manufactured by the Intel Corporation.
At this point a simple distinction should be drawn between a basic fault tolerant system and a basic multiprocessor system. In general, multiprocessor systems use more than one processor to work on different parts of the same job. Usually, in multiprocessor systems, there is one "manager" processor that divides up the job into smaller tasks and assigns the tasks to the other processors in the multiprocessor system. The managing processor may then begin a task itself or oversee the entire job trying to optimize the system's performance by insuring all of the processors in the system are processing an equal amount of work. Load sharing is a term often used to describe the type of work done by basic multiprocessor systems. In contrast, a basic fault tolerant system does not divide up the work load. Instead, each processor in a fault tolerant system does the entire job so that more than one processor is performing the same job. The same instructions and data are processed by each of the processors in a basic fault tolerant system. In this way, if one processor fails at any time another processor can take its place and take over the processing chore for the failed procesor. A multiprocessor system would have faster results on a large problem than a fault tolerant system, but, if one of the processors in each of the above system failed, the fault tolerant system would be the only one to complete the job without user intervention.
There are many reasons why one of the processors in a fault tolerant system may be temporarily taken out of operation. Maintenance activities, such as repair of a faulty board or upgrading of the operating system, may force temporary "down time". Detection and subsequent correction of a fault or error are examples of other circumstances that may cause a processor in a dual processor system to be temporarily taken "off line". The terms CPU and processor are well known equivalents in the art and will be used interchangeably in this document. No matter what the reason, after either one of the processors has been off line, it will no longer be in synchronization with the processor which remained on line. Synchronization in this context refers to timing and also to having identical data in each processor. The areas of concern, in regards to the data in each processor are the internal registers and main memory. Main memory, or just memory, refers to the random access memory or RAM associated with each CPU. Main memory may be divided into more than one portion, with each portion having defined addressing limits. Also, each CPU may have more than one "main" memory, in which case each memory would be given a different name to avoid confusion and addressing limits would not be a concern. The state of a CPU is defined by the contents of the internal registers, or hardware registers, of the CPU. It will be understood that, although the state of a CPU may include small memories such as caches and tables which may be used for branch prediction and linking purposes, the contents of register memory is generally accepted as defining the state of a CPU.
Prior to a restart, the processor which was taken off line, or faulty processor as it will be referred to, must be updated with the state of the processor which remained on line, or current processor. In other words, the contents of the current processor's internal registers must be loaded into the internal registers of the faulty processor. The memory of the faulty processor also needs to be loaded with the data in the memory of the current processor. This entire process is called updating or re-integration.
The challenge involved in re-integration is to complete the process in as little time as possible. Time is of the essence in the re-integration process because both CPU's must be involved in the re-integration process. Therefore, system application execution is temporarily stopped. As a result, overall system throughput is reduced. In dual-processor operations, degradation of system performance is directly proportional to the length of time required for re-integration. It is therefore important to provide a method by which a processor in a dual processor system may be updated in as little time as possible.
Two known methods of doing re-integration in dual processor computers can be referred to as "copy main memory" and "copy instruction execution results". In copy main memory, which is illustrated in FIG. 1, the contents of main memory (EX) 12 are copied to the main memory (SB) 22 of the SB CPU 2. The state of the EX CPU 1, which is held in registers 11, is then copied to both main memory (EX) 12 and main memory (SB) 22. Synchronous restart is initiated reading the data formerly held in registers 11 into both CPU's in parallel. This method is used, for example, in the IMP and Tandem Integrity fault tolerant systems. The drawback with this method is that it is slow because main memory, which may be an order of magnitude slower than registers, is intimately involved. Further, transfer of the state of the EX CPU 1 requires two main memory operations, a write and a read, since the contents of the internal registers must first be transferred to memory before they can be transferred to the SB CPU 2. The result is a long stop of application execution, which as mentioned above, degrades system performance.
FIG. 2 illustrates the second known re-integration method, copy instruction execution results. This method copies the results of all instructions that execute in the EX CPU 1 to the SB CPU 2. In this figure, EX CPU 1 is the current processor and SB CPU 2 is the faulty processor. Instruction pipelines 15 and 26 represent the basic functions performed in each CPU, respectively. Stages of a typical pipelined processor include: fetch, decode, execute, memory access and writeback. Writeback unit 152 of the current CPU transfers the results of each executed instruction over update bus 31 to writeback unit 262 of the faulty CPU. Data from the registers and main memory of EX CPU 1 are also transferred through the writeback units of each processor. This method requires extra hardware in the writeback unit of each processor in order to transfer all of the required data.
In the copy instruction execution results method, the microinstruction execution unit in the faulty CPU receives only an address to its control memory from the current CPU. Consequently, the microprogram in both CPU's must be the same. This means that the faulty CPU is forced to follow the current CPU regardless of the contents in the faulty CPU's control memory. Typically a microprogram is stored in a read only memory, also known as a control store, of a computer. Microprograms control the manner in which the hardware of a CPU reacts to the instructions of an application that are executed in the instruction pipeline of the CPU.
A more detailed view of the instruction execution method can be seen in FIG. 3, which shows results bus 265 of the faulty CPU can receive data from one of two sources. During a re-integration operation, result bus 265 receives data from instruction path 101 of the current CPU, through update bus 31 and MUX 29. During normal operations, result bus 265 receives data from instruction path 201 of its own CPU. Re-integration of the two CPU's is signaled to begin when the working state of the faulty CPU is changed from Stand By/Halt (SB/HA) to Stand By/Update (SB/UP). The processors are said to be in the working state SB/UP when the result bus 265 in the faulty CPU is receiving data from the instruction path 101 of the current CPU over update bus 31. APZ212-20 is an example of a dual-processor system that uses the copy instruction execution results method.
The drawback with this system is that a large volume of temporary information is copied to the faulty CPU. As a result, most of the information copied to the faulty CPU will be overwritten by new information almost immediately. To support this large volume of information that must be transferred the bandwidth of update bus 31 must be made equally large. Thus, requiring extra traces on a printed circuit board and extra pins on each CPU. This leads to complicated electrical and mechanical designs of systems which use this method of re-integration.