There are many systems which include a plurality of modules distributed along a common link, such as a bus or a plurality of buses. Often, time division multiplexing is utilized to provide efficient information transfer between the distributed modules and to achieve maximum system capacity. In systems of this kind, the bus or buses are divided into a plurality of time slots. Each module is assigned a predetermined time slot into which it can insert information onto the bus and means for receiving information from any one of the time slots. In this manner, any one module is capable of transferring information to any other module, and in turn, capable of receiving information from any other module.
In addition to the foregoing, one of the time slots of one of the buses or a separate bus can be dedicated to permit each module to address or send data to any one of the other modules. Further, a central common control of the data bus or slot is typically provided to control the overall operation of the system. The central common control can provide, for example, system clocks, data bus or slot arbitration, and guard tone generation. Because of the importance of the central common control to the system, it is typically provided in a redundant manner so that if one central common control develops a fault, the system can be switched over to the other redundant central common control.
A problem which can arise in such a system is the location of the fault detecting intelligent module of the system. If it resides in a central, stand alone location, system reliability is compromised should the central fault detecting module or node fail.
Protection against misuse of the buses is another problem. Module failure of any unshared circuitry associated with communicating on the buses could render the buses either totally or partially inoperative.
Prior art systems have addressed this problem by switching to redundant buses and bus devices. While such approaches can be generally successful, they exhibit certain undesirable effects. First, it increases system cost. This results because all bus drives and related circuitry must be duplicated. Second, total system capacity may not be realized with just one time division multiplexed (TDM) bus. As a result, a plurality of redundant buses may be required.
Prior art systems are generally arranged so that if a module failure renders one of the buses inoperative, all the buses may be switched over to redundant buses or just the failed bus may be switched to its redundant bus. Neither of these arrangements is totally satisfactory, and in the latter one, additional input-out or decoding circuitry must be provided for every TDM switch user to selectively and properly make the switch. This approach both adds cost to the system and adversely affects system reliability.
One improvement to prior art systems of this type is fully disclosed and claimed in copending U.S. application Ser. No. 511,701 filed July 7, 1983 for Method and Apparatus For The Selection of Redundant System Modules, which application is assigned to the assignee of the present invention and incorporated herein by reference. The system there disclosed includes a redundant central common control referred to as MUX Common. However, the switching between the main MUX Common and the redundant MUX Common is not initiated by a centrally located fault detecting module. Instead, a plurality of modules associated with the buses are active fault detecting modules or nodes, each continuously checking the system in parallel for faults. When a fault is detected by one of these active modules, it places a vote indicating that a fault has been detected. If a predetermined number, for example, a majority, of the active modules vote, the system then switches from the then active MUX Common to the other MUX Common. Hence, the switching to the redundant module is not commanded by a single fault detecting module, but instead, by a majority of a plurality of fault detecting modules distributed throughout the system. As a result, since a single fault detecting node is not relied upon, system reliability is greatly improved.
Even though the foregoing system exhibits many advantages over prior systems for detecting faults, the switching to a redundant MUX Common may not always rectify the fault or problem with the system. The present invention, however, provides a further improvement thereto in that not only is the fault detection distributed throughout the system, but the fault isolation and recovery is also distributed throughout the system as well. As a result, a single node is not relied upon for fault isolation and recovery, but instead, this function is distributed throughout the system so that if one fault isolation and recovery node fails, another one immediately takes its place to restore the system to optimized operation.
It is therefore a general object of the present invention to provide a new and improved distributed fault isolation and recovery system and method for recovering a system of the type including a plurality of modules or nodes which experience a fault in an optimized configuration.
It is a further object of the present invention to provide such a system and method wherein the fault isolation and recovery process is initialized based upon a distributed detection of a fault.
It is a still further object of the present invention to provide such a system and method wherein any one of a plurality of modules or nodes includes means for performing the testing required for recovering the system.
It is still another object of the present invention wherein the node or module performing the testing of the system must pass internal testing prior to proceeding with the fault isolation and system recovery.