The present invention relates to a distributed processing system, and more particularly to a distributed processing system which includes mutually related subsystems and in which a fault in one subsystem may adversely affect other subsystems and in which a system-down condition is prevented and the degree or magnitude of the failure is diagnosed so as to improve system reliability.
FIG. 1 shows a comparing table for fault detection and diagnostics and recovery processing based on the result of diagnostic operations in a previously proposed distributed processing system and a recovery processing scheme according to the present invention. The ordinate in FIG. 1 represents an object item to be detected and diagnosed, and the abscissa represents the object in need of recovery as a result of the detection and diagnosis.
In FIG. 1, an apparatus which detects a fault in its own subsystem and recovers from the fault in its own subsystem or faults in another subsystem based on the detection of the fault is called a self-diagnostic tester. In a self-diagnostic tester, each subsystem is configured on the presumption that "another subsystem perfectly detects and diagnoses all faults and recovers the faults based on the result of the diagnosis". Accordingly, if a subsystem fails to detect, diagnose or recover from the fault or makes a mistake, the other subsystems are influenced and a system-down condition is caused. An apparatus which detects and diagnoses a fault in another subsystem and recovers from the fault in the other subsystem based on the result of the diagnosis is called a centralized tester. A failure or a mistake in the detection, diagnosis or recovery in a centralized tester also directly affects other subsystems and causes the system-down condition.
In accordance with the present invention, on the other hand, an autonomous tester detects and diagnoses faults in other subsystems and protects its own subsystem from the faults of the other subsystems based on the result of the diagnosis. The functions of these three types of testers are illustrated in FIGS. 2 and 3.
A self-diagnostic tester includes the two types shown in FIGS. 2A and 2B. A tester in a subsystem 1 controls a fault recovery processing operation 21 in its own subsystem 1 or fault recovery processing operation 22 or 22' in other subsystems 2 or 3 based on the detection and diagnosis 11 or 12 of a fault in its own subsystem. If fault recovery processing operation is not correctly effected (symbol x in FIGS. 3A and 3B), the other subsystems 2 and 3 are affected (31, 32 or 32') and faults are caused therein.
In a centralized tester (FIG. 2D), the tester in the subsystem 1 detects and diagnoses (14, 14') faults in the other subsystems 2 and 3 and controls fault recovery processing operations 24 and 24' in the subsystems 2 and 3 based on the diagnosis. Accordingly, a fault in the subsystem 1 directly affects (34, 34') the other subsystems 2 and 3.
On the other hand, in an autonomous tester (FIG. 2C), the tester in the subsystem 1 detects and diagnoses (13, 13') faults in other subsystems 2 and 3 and controls a fault recovery processing operation 23 to protect its own subsystem from being affected by faults in other subsystems based on the diagnosis. Accordingly, a failure or a mistake in the detection, diagnosis or recovery processing operation for the fault in the subsystem 1 does not affect the other subsystems 2 and 3.
In the prior art distributed processing system, particularly in a system having a plurality of processors coupled through transmission lines, when a transient fault occurs, there is no means to determine whether it is indeed only a temporary fault or it is highly probably that the fault will become a permanent.