In large data processing systems, the location of the circuits causing errors is a difficult task. One difficulty is that the location of data changes each cycle of the machine. Once an error is made, the error tends to become propagated to different locations throughout the machine. Further, in each subsequent cycle after the rerror-causing cycle, the original error frequently causes many more errors. This propagation and proliferation of errors tends to mask the data location which originally caused the error.
One error checking and locating mechanism is described in U.S. Pat. No. 4,132,243, entitled "Data Processing System and Information Scan Out Employing Check-sums for Error Detection"assigned to same assignee as the present invention.
In that patent, the data processing system includes an instruction-controlled principal apparatus and secondary apparatus for independently addressing and accessing points within the principal apparatus. A check-sum generator generates an actual check-sum dependent upon the data values of selected points accessed within the principal apparatus. The particular set of points accessed is controlled by the secondary apparatus. The secondary apparatus stores an expected check-sum for comparison with the actual check-sum. If a comparison indicates that the actual check sum differs from the expected check-sum, a fault is indicated within the set of points used in forming the check-sum.
Once a fault has been detected through comparisons of actual and expected check-sums, it is possible to further analyze the set of points which entered into the check-sum to determine what subset of points is the source of the fault. The set of points or the subset of points accessed to form a check-sum is controlled by the secondary apparatus.
While the check-sum mechanism of U.S. Pat. No. 4,132,243 has proved very useful, it still has the problem that it requires storage of a large number of expected check-sums to reflect the many error-free states of the computer. Furthermore, improvements and changes to the circuitry and operation of the system mandate that the expected check-sums change. Accordingly, keeping track of the expected check-sums is somewhat of an undesirable burden.
Recent data processing systems have included diagnostic scan out capabilities which help locate errors in data processing systems. One such scan out system is described in U.S. Pat. No. 4,244,019 entitled "Data Processing System Including A Program-Executing Primary System" assigned to the same assignee as the present invention.
The 4,244,019 patent provides a mechanism for scan out of all designed locations within a data processing system, independently of the normal data paths of that system. This scan out ability is of significant value in locating errors, and each location which has an error can be examined independently. However, the ablity to examine thousands of locations within a data processing system does not assist in a quick location of the errors without further information as to which locations may be the cause of the errors.
One error-tracking unit within a data processing system is described in prior U.S. patent application entitled ERROR TRACKING APPARATUS IN A DATA PROCESSING SYSTEM, Ser. No. 527,173, filed Aug. 26, 1983, now abondoned, invented by Venkatramiah Venkatesh and Robert M. Maier, which is owned by the same assignee as the present invention and was owned by the same assignee when both inventions were made. According to the prior application Ser. No. 527,173, each data location to be checked for error and to be located in the case of an error is provided with error detection circuitry. Each data location is additionally provided with an error history register for storing an error signal. When the error-detecting circuit detects an error, the error history register is enabled to store the error signal. Whenever an error is detected, the error history registers are inhibited from further change. The error detection also causes a machine check signal which, in general, prevents the data processing system from normal processing.
Further, in the prior application Ser. No. 527,173, the data locations to be error detected and error located are organized into a hierarchy of sets and subsets within the data processing system. In a three-level hierarchy the subsets are named sections, blocks, and units. Each of the data locations in a section have their error detecting signal lines combined and encoded to form a section error signal. The section error signals from a plurality of sections in turn are combined to form a block error signal. A plurality of block error signals are combined to form a unit error signal. Groups of error signals form sections, blocks and units are encoded at each level to reduce the number of error signals employed.
In the error tracking system taught by the prior application Ser. No. 527,173, under the condition that a single data location causes an error, the error signal will be propagated through the subsets. For example, a data location error signal will cause a section error signal which in turn will cause a block error signal which in turn will cause a unit error signal. The error signals identify where in the system that the error is located. The unit error signal identifies one of a number of units, the block error signal identifies one of a number of blocks in the unit, and the section error signal identifies one of a number of sections in a block.
For optimum operation of the system in prior application Ser. No. 527,173, the error history registers must be frozen in the same cycle that an error is detected. In this way, propagation of errors throughout the system is minimized. The grouping and encoding of locations to be checked provides a track which allows the error location to be easily identified.
In practice, however, freezing error history registers in the same cycle that an error is detected can be difficult to implement because all error history registers have to be notified upon the occurrence of an error at any one of the error history registers. Thus it typically takes more than one cycle in a large system to freeze all error history registers. So additional error history registers may be latched due to a single error as it propagates through the system before all registers are frozen.
Further, when the error history registers are frozen in one cycle, or a few cycles, the system may continue to process data for a number of additional cycles. During this window in which a number of system cycles may occur after the error history registers are frozen, any independent errors that may occur may not be latched and therefore may go undetected.