The invention relates generally to multiprocessor computer systems and deals more particularly with detection of errors in microinstructions fetched from main memory and in the case of error, reloading the proper microinstruction without interrupting operation of the other processors.
The problem of ensuring that a microinstruction word to be executed by a processing unit is valid is one of the most important considerations when designing a processing system. If the microinstruction to be executed is invalid, then the processing unit will not be able to execute it and may stop functioning. The problem is particularly acute in systems with multiple processing units such as the IBM/390 systems since if one of the processing units stops working because it is trying to execute an invalid instruction, then data may be lost and the integrity of the data required by the other processing units cannot be guaranteed.
The problem is becoming particularly acute as the on-chip memories in which the microinstructions are stored become smaller. Even though shielding is incorporated into such memories, the loss of data due to alpha particles changing the value of some bits in the memory (so-called "soft errors") has to be considered by the designers of processing systems.
One well known method in the prior art for ensuring the validity of a microinstruction is to add a parity bit to the microinstruction. Before the microinstruction is executed, its parity is checked and, if it is found to be incorrect, remedial action is carried out.
U.S. Pat. No. 4,231,089 describes one type of remedial action that may be carried out. In this document a method and apparatus is disclosed which enables an erroneous microinstruction to be rewritten before it is executed. In the described system, the microinstruction is first written from memory into a control register. A parity network coupled to the control register determines whether a valid microinstruction is being executed. If this is not the case, then the clock pulses to the data paths coupled to the control register are inhibited and a clock stop recovery is carried out. The original microinstruction is fetched from secondary storage, such as tape or disc, and rewritten into the microinstruction memory. After rewriting the microinstruction memory, the signal inhibiting the clock pulses is removed, thereby allowing the rewritten microinstruction to be executed.
The processing system disclosed in U.S. Pat. No. 4,231,089 comprises a single processor and thus stopping the clock in order to reload the microinstruction will not affect the operation of any further processors. In a multi-processor system, however, in which every processor has its own cache (or even several caches in a hierarchical system), the effect of stopping the clock of one processor could be catastrophic. The system could no longer be certain of the integrity of the data in the main memory, since data from the cache memories associated with the stopped processor will not be transferred back to the main memory and the validity bits associated with the lines of data in the cache memories will not be set or reset. There is a risk therefore that customer data will be lost which could cause severe problems for the user of the processing system. One example in which the loss of data would have serious consequences is the use of such a processor in a bank cash dispensing system.
In addition to the loss of integrity of the data, the time required to restart the processor may be significant. When the clock in one of the processing units is stopped, then the processing unit would have to be restarted using a service processor. This may take several seconds, a time period which is unacceptable for many applications. Furthermore, peripheral units such as input/output subsystems and channels connected to the processor may be timed out if the clock in one of the processing units is stopped. The peripheral units may also need to be restarted, further delaying the restart of the system.
One means of overcoming these problems would be to add error correcting and checking bits to the microinstruction. This would allow an error checking and recovery mechanism to be built into the processing system to check the microinstruction before it was executed and, if an error was detected, to correct it. Such a system, however, would require larger memories to be constructed to store both the microinstruction and its associated check bits. In addition the execution of an invalid microinstruction would be delayed whilst it was corrected.
U.S. Pat. No. 4,901,228 teaches a pipelined cache system which uses back up address registers for providing error recovery while continuing pipeline processing. The cache memory includes an address back-up register which stores the physical address of the data in which an error is detected and backs-up the address register. The cache memory further includes a data array address back-up register which stores the data address of the data in which errors are generated and backs up the data array address register. The cache memory also includes a request code back-up register which stores a preceding request code at the time that the errors are generated. When an error is detected by the system controller, data is read from the main memory according to the back-up registers and other registers of the system controller.
The pipelined cache system taught in this patent has the disadvantage that it requires a large number of back-up registers plus accompanying circuitry for ensuring that the registers are backed-up correctly. This adds to the complexity of the system and the space required on the chip.
The pipelined cache system also has the disadvantage that it requires two cycles in order to generate the signal indicating that an error has been detected. In memory systems such as those used in the low end IBM/390 machines only one cycle is required for obtaining data from a cache memory and thus a signal indicating an error in the data must be generated in the same cycle. Using the teachings of U.S. Pat. No. 4,901,228 this is not possible.
From EP-A-0 178 670 is known a system for providing microinstruction read error detection and recovery by permitting the unchecked microinstruction to be executed in parallel with microinstruction read error detection. The microinstruction execution is aborted upon detection of an error prior to making any change in the state of the system which would prevent the retrying of the microinstruction after it is read again from memory. Because the reading of the next microinstruction is initiated during the execution of the current microinstruction whose validity is checked in parallel with its execution, the system provides for the discarding of the next microinstruction prior to retrying the current microinstruction by rereading the current microinstruction followed by re-execution of the current microinstruction. While the current microinstruction is being re-executed, the rereading of the next microinstruction is initiated so that it will be available for execution following the re-execution of the current microinstruction. The disclosed system does not, however, teach a method for recovery should the microinstruction in the control store be in error.
The object of the invention is to design a system and method for correcting microinstructions which overcomes the disadvantages of the prior art systems.