1. Field of the Invention
The invention relates generally to computer systems. The invention relates more specifically to a method and apparatus for handling error conditions in a digital computer whose data moves through a data switching unit.
2a. Cross Reference to Related Applications
The following copending U.S. patent application(s) is/are assigned to the assignee of the present application, is/are related to the present application and its/their disclosure(s) is/are incorporated herein by reference:
(A) Ser. No. 07/670,289 entitled "SCANNABLE SYSTEM WITH ADDRESSABLE SCAN RESET GROUPS", by Robert Edwards et al, which was filed Mar. 15, 1991 [Atty. Docket No. AMDH7954]; PA1 (B) Ser. No. 07/813,891 filed Dec. 23, 1991 by Christopher Y. Satterlee et al, and entitled, IMPROVED METHOD AND APPARATUS FOR LOCATING SOURCE OF ERROR IN HIGH-SPEED SYNCHRONOUS SYSTEMS [Attorney Docket No. AMDH7952]; PA1 (A) U.S. Pat. No. 3,840,861, DATA PROCESSING SYSTEM HAVING AN INSTRUCTION PIPELINE FOR CONCURRENTLY PROCESSING A PLURALITY OF INSTRUCTIONS, issued to Amdahl et al, Oct. 8, 1974; PA1 (C) U.S. Pat. No. 4,244,019, DATA PROCESSING SYSTEM INCLUDING A PROGRAM-EXECUTING SECONDARY SYSTEM CONTROLLING A PROGRAM-EXECUTING PRIMARY SYSTEM, issued to Anderson et al, Jan. 6, 1981; PA1 (D) U.S. Pat. No. 4,661,953, ERROR TRACKING APPARATUS IN A DATA PROCESSING SYSTEM, issued to Venkatesh et al, Apr. 28, 1987; PA1 (E) U.S. Pat. No. 4,679,195, ERROR TRACKING APPARATUS IN A DATA PROCESSING SYSTEM, issued to Dewey, Jul. 7, 1987; PA1 (F) U.S. Pat. No. 4,685,058, TWO-STAGE PIPELINED EXECUTION UNIT AND CONTROL STORES, issued to Lee et al, Aug. 4, 1987; PA1 (G) U.S. Pat. No. 4,752,907, INTEGRATED CIRCUIT SCANNING APPARATUS HAVING SCANNING DATA LINES FOR CONNECTING SELECTED DATA LOCATIONS TO AN I/O TERMINAL, issued to Si, et al. Jun. 21, 1988; PA1 (H) U.S. Pat. No. 4,819,166, MULTI-MODE SCAN APPARATUS, issued to Si et al. Apr. 4, 1989. PA1 (1) Data-request signals sent from data-needing parts of the computer to indicate that the requesters are ready for or need new data; PA1 (2) Source address/size signals for identifying the source of the requested data (the data-possessing part which is to send the data) and the number of data bits to be sent; PA1 (3) Source sequence signals for indicating the sequence in which source data items will be delivered to the router (e.g., by numerical sequence or alphabetic sequence or some other sequence); PA1 (4) Source delivery-complete signals for indicating when delivery of data to the router from each source is complete; PA1 (5) Intermediate routing signals for identifying an intermediate route through the router which source data will take as it travels downstream from a source point to a destination point; PA1 (6) Destination address signals for identifying the destination point or points to which each block of source data will be sent (for identifying each data-needing part that will receive the routed data); PA1 (7) Destination sequence signals for indicating the sequence in which data items will be delivered from the router to each destination point; and PA1 (8) Destination delivery-complete signals for indicating when delivery of data from the router to a destination point is complete. PA1 (a) First inspecting destination address signals held in the router to determine which destination points were to receive the error-corrupted data; PA1 (b) Secondly, inspecting source address signals which are also held within the router to determine the size or extent of data corrupted by the error condition and to identify the source from which the error-corrupted data came; PA1 (c) Thirdly, determining the extent of damage and figuring out how to handle the particular error condition; PA1 (d) Fourthly, routing error-reporting signals to the affected destination points over a set of control paths (which control paths are provided separately from data routing paths) to thereby warn the destination points that recently received data is suspect and should be discarded; and PA1 (e) Fifthly, routing error-recovery signals (retry commands) to the affected source points over a set of control paths (which are provided separately from the data routing paths) to thereby instruct the source points to halt transmission of the error-corrupted block if not yet completed and retransmit the data starting with the part which the destination points have been instructed to discard.
2b. Cross Reference to Related Patents
The following U.S. Patents are assigned to the assignee of the present application and are further incorporated herein by reference:
(B) PROGRAM EVENT RECORDER AND DATA PROCESSING SYSTEM, U.S. Pat. No. 3,931,611, issued to Grant et al, Jan. 6, 1976;
3. Description of the Related Art
When an error occurs somewhere within a high-speed digital computer, it takes a finite amount of time to: (1) detect the error, (2) set an error flag at the point of detection, (3) transmit a message to a central control unit indicating that the error has occurred, (4) determine how to handle the error condition, and (5) transmit control signals from the central control unit to affected parts of the computer informing them of the error condition and directing them to take corrective action.
It is advantageous to complete all of the above steps (1)-(5), and particularly steps (4) and (5), as quickly as possible so that corrective action can take place in the error-infected parts of the computer as soon as possible. Preferably, correction in the error-infected parts should complete without slowing down operations in non-infected parts of the computer system.
A unique set of problems develop when the computer includes a central data-routing unit (e.g. a crossbar data router). Such a unit routes data selectively between data-possessing parts of the computer (data sources) and data-needing parts of the computer (data destinations). Typically, the router unit operates under the command of a central control unit.
If an error condition arises in a control path of the data-routing unit, data might be fetched from the wrong source, or delivered to a wrong destination, or it might become lost in the router and never delivered to any destination, and/or the control error may damage existing data then being transferred through the data-routing unit to a different destination.
The central control unit of the computer is conventionally charged with the task of coordinating recovery activities when an error condition is detected in one or more control signals of the routing unit.
The central control unit has to determine where each error-infected control signal is located within the router or a related component. It has to decide which of the numerous data source and destination parts of the computer are to have their most recent operations nullified because of the error. And it has to determine which specific blocks of data in the router or data destinations are infected by the error. The central control unit then has to direct the affected destination points to discard error-corrupted portions of the data they received from the router, and it has to also direct the corresponding data sources to retransmit such data through the router to the intended destinations.
The router control signals which could be infected with errors typically include some or all of the signals named in the following, non-exhaustive listing:
Conventionally, when an error condition is detected in one of the control signals of the router, the central control of the computer coordinates error recovery activities by:
There are several drawbacks to this approach.
A first drawback is that an excessive amount of time may be consumed by the (a) step of determining which destination points are to be informed.
A second drawback arises from the characteristics of data flow in high speed computers and takes a little more to explain. Signals are continuously flowing from one part to the next. By the time an error condition is detected and acted upon, destination address signals will have already moved downstream, from a source point into an interior part of the router. Often, the address signals are at that time held exclusively in the router. It may take the central control unit a substantial amount of time to retrieve them out of the router. While this is disadvantageous in itself, it should be recognized that even when they are finally retrieved, the destination address signals may be of little value if a problem resides within address carrying circuits of the router that are located downstream of the point from which the destination address signals were retrieved. If the more downstream version of the address signals (e.g., those flowing closer to the destination end of the router) are corrupted by an error condition (e.g. by line noise), the retrieved upstream version of the destination address signals may fail to correctly point to the downstream destination which actually receives the data.
A third drawback of the above approach is that an excessive amount of wiring might be required for supporting above steps (c), (d) and (e). Because special wires extend from the central control unit for determining the extent of damage in other parts of the machine, for sending error reporting signals to the destination points and for sending retry signals to corresponding source points, the number of wires grows as the number of source, destination and intermediate points in the system grows.
Also, because it takes time to determine the extent of damage and/or because wire lengths are not always the same, it may take more time for error-reporting signals to reach all the affected destination points than it does for the initial error-corrupted data to move completely out of a source unit and completely into a destination unit. If data runs ahead of error reports, the computer is burdened with the job of correcting a wider propagation of erroneous data. The data source continues to transmit new blocks of data until it is commanded to stop and retry from an earlier point. The data destination unit (which unit is not necessarily the desired destination point) continues to receive and perhaps act on the error-infected data until it is told to stop. After source and destination units halt their respective error-affected activities, the computer's central control has to determine for each separately-transmitted or received item of data whether it is affected by the error condition and it then has to indicate in some way that the item of data is part of an error-infected block.
A fourth drawback of the above approach is that operations may have to be halted for excessively long time periods throughout the computer, even in parts not affected by the error.
A fifth drawback is that system size and costs increase substantially when upward scaling of the system is carried out. In such scaling, the computational capabilities of the system are enhanced by increasing the number of source/destination points tied to the router. This inherently increases the number of wires needed for separately transmitting error-reports/retry-commands to destination/source points in the case of a control error. As the number of service ports on the router is increased, additional wires have to be added between the router and the central control for enabling the central control to inspect the conditions at the additional service ports when an error condition is detected. Also, system speed decreases with upward scaling because, as the number of source/destination points increases, the central control unit is burdened more and more with the job of overseeing recovery activities in an increasing number of source and destination points.
A better approach is needed.