1. Technical Field
The present invention relates to large scale data processing systems and, more specifically, to error servicing circuitry and control logic in a data processing machine.
2. Description of Related Art
Large scale data processing systems, which include high speed digital computers, are comprised of a plurality of units such as a central processing unit (CPU), CPU storage unit, system control unit and main storage unit. These units may be physically located on one or several printed circuit boards, and each printed circuit board contains a plurality of integrated circuits. Each integrated circuit, contains from hundreds to hundreds of thousands of gates, where "gates" is defined as a collection of transistors which perform a logic function.
Approximately one third of the hardware which comprises the data processing system is dedicated to error detection and correction. This circuitry, which is transparent to a user, includes error detection devices and error latches to facilitate location of an error and error recovery hardware and software which perform error analysis and correction, where possible. A significant portion of the error servicing circuitry operates when the system clock are disabled.
When an error is detected, the error is latched and a signal from the latch is sent to a clock control unit (CCU) and the error servicing unit (ESU), prompting the clock control unit to disable system clocks. The stopping of the system clock freezes operation of the data processing system and permits a recovery algorithm to be implemented by the error servicing unit to process and correct, where possible, the detected error(s). Conventional data processing systems may operate at system clock speeds of 100 MHz or greater. As a result, when an error occurs, it may propagate throughout the system causing wide spread corruption of data before the system clock is disabled. Therefore, a great emphasis is placed on disabling the system clock as rapidly as possible in response to the detection of an error.
Referring to FIG. 1, a timeline is shown illustrating critical time periods involved in the detection of an error, the processing of that error, and the maximum permissible time period for the system clock to be disabled. The timeline 15 represents a chronological progression in time from left to right. At the moment 10, an error is detected and a generated error signal is sent to the error servicing unit. The period after the detection of a fault, but prior to disabling the system clock is designated as period A and is significant because during this period the system clock continues to run and erroneous data continues to be processed. It is important that period A be as short as possible to minimize fault propagation in the data processing system.
At moment 11, the error servicing unit disables the system clock and the system clock off period, designated by letter C, begins. In conventional data processing systems, this period can be no greater than a critical period of approximately 1 second. This is because the CPU of the data processing system is in communication with input/output facilities which require responses within specific time periods not exceeding the critical period. During the period C, an error servicing algorithm interrogates the error latches to determine location of an error and begins error processing to recover incorrectly processed data and to restore the data processing system to an error free state. The period B represents the time period required by the error servicing unit to detect the initial location of an error. Period D represents the period of time required by the error servicing algorithm to recover data and restore the system to a proper state. It is very important that period B be as short as possible because rapid determination of the location of an error provides the error servicing algorithm more time within period C to recover and restore data. A reduction in period B reduces the overall period C, thus minimizing the probability that the critical period will be exceeded. How the prior art has addressed these problems will now be discussed.
Referring to FIG. 2, a schematic view of a portion of a data processing system is shown, illustrating both a network of error detection devices and error latches, and a plurality of interconnected functional gates. The two primary purposes of FIG. 2 are to illustrate (1) the hierarchical structure of the error detection and latching network and (2) the speed at which one erroneous data bit may spread throughout the system corrupting data at those locations to which it propagates. Discussing the hierarchical structure of the error latching network first, block 51 will be used for purposes of illustration, to represent a memory device from which is output a signal onto line 91. Block 61 represents a parity checker. Note that a parity checker is used merely for purposes of this example, and that there are several other types of detection devices which could be substituted therefor. If the output of memory device 51 contains a parity error, then the parity checker 61 will detect this error and generate an error signal which is latched by error history latch 71. The output of error history latch 71 is connected to the input of a second stage error history latch 74 that also receives inputs from several other first stage error history latches, including error history latches 77 and 78. The output of the second stage history latch 74 is, in turn, input to a third stage error history latch 81 to which error history latches 75 and 76 are input. The output of the third stage error history latch 81 is connected to a fourth stage error history latch 82 (extended to the chip 90) which is further connected to a fifth stage error history latch 83 as shown. Each of the error history latches 71-78 and 81-83 are clocked by a system clock signal. This hierarchy is continued to a final latch register (not shown) for transmitting an error signal directly to the ESU. To better understand the hierarchical structure, a general overview is now presented.
The error detection and latching network, illustrated in part in FIG. 2, operates in a vast plurality of blocks. At the lowest level, a plurality of error history latches are bundled together to form an EHL group. A chip may contain several EHL groups and at a second lowest level, all the EHL groups for a particular chip are bundled together. At a third level, a plurality of chips are bundled together to form a section and at a fourth level, the plurality of sections for a particular unit are bundled together to form the error signal line for that unit. At a highest level, all of the error signal lines for a plurality of units on a unitary circuit board are bundled together to form a board level error signal which is propagated to the ESU and the CCU.
Applying this to the example of FIG. 2, an error detected by parity checker 61 will be gated through a series of error history latches on the chip 90 within which it resides, and further through a plurality of error history latches at the section, unit and board levels. In a conventional data processing system, this may entail, on average, a path of approximately 6-7 latches, thus necessitating 6-7 clock cycles before the CCU receives a generated error signal, a disadvantageously long delay. Upon reaching the clock control unit, several clock cycles are further required to "early-up" the system clock to determine an appropriate time to stop all the system clocks propagating in the system.
During this period (period A), between the detection of an error and disabling the system clock, erroneous data is being processed by the machine. To illustrate propagation of the error and the detection of this error by subsequent error detection devices, we will assume that the output of memory 51 is connected to a recursive adder 52 and off chip to a functional gate 54 located on chip 92. One clock cycle after the erroneous data was detected by the parity checker 61, it will be detected by parity checker 64. Detection of this error will cause the parity checker 64 to generate an error signal which is propagated to error history latch 72 and, in turn propagated to second stage error history latch 75. The output of second stage error history latch 75 is input to error history latch 81 and propagates as explained above with reference to error history latch 74. Since the adder 52 is recursive, as are those used in random number generating and digital signal processing circuits, the output of the adder 52 is feedback as an input. Therefore, the error first detected by parity checker 61 will be continually output from the adder 52.
Continuing with our example, the output of adder 52 is connected to a multiplier 53. Two clock cycles after the error was detected by parity checker 61, it is detected by parity checker 66. Parity checker 66 generates an error signal in response thereto which is propagated to error history latch 73. On the next error clock pulse, the error history latch 73 propagates this error signal to second stages error history latch 76 from which it is propagated to history latches 81-83 as described above. Thus, not only will an error propagate to subsequent functional gates and corrupt the data process therein, an error signal may be generated at each subsequent gate. Referring to chip 92, the function gate 54 receives data output on line 91. Therefore, the error detected by parity checker 61 will propagate to gate 54. This error will, in turn, propagate to gates 56 and 57, and then again off chip 92 to gate 58. In the small isolated portion of the system shown in FIG. 2, the error has propagated within three clock cycles to six gates, 52-58, corrupting data therein. The longer it takes for the error servicing unit to receive an error signal and generate a clocks off signal, the further erroneous data will propagate. The further erroneous data propagates, the longer the amount of time required during period D for the error recovery algorithm to recover the appropriate machine state. Thus, it is critical that period A be as short as possible.
A second undesirable feature of the prior art is the relatively long process that determines which data location caused the generation of an error signal (period B). Once the error servicing unit has successfully turned off the system clock, a determination is made of the first location which generated an error signal. This is done by a technique known as scan out, which is well known in the art. In conventional data processing system scan out procedures, each of the error history latches is polled to determine if an error signal is present. The polling process begins at a predetermined location and polls sequentially until all of the error history latches have been scanned. Since a conventional large scale data processing system may contain approximately 10,000 error history latches, the error servicing unit must necessarily poll 10,000 locations each time an error signal is detected. This requires a significant amount of time and disadvantageously extends the period B. Furthermore, an extension of period B detracts from the amount of time available in period D for the recovery algorithm to recover data and restore the system to a correctly functioning state.