1. Field of the Invention
The invention relates in general to a data processing device, and more particularly, to an ultra high availability clock chip for use by multiple, synchronized card assemblies.
2. Description of Related Art
High availability File Servers and control units of redundant (RAD) disk drive sets, as well as massively parallel processor arrays are pushing into the practical barrier of cost control. Multiple card assemblies that must work together quite often have been assembled with independent clock sources on each card. In some configurations, synchronous operation of the card set is an advantage to the assembled function, so that inventive solutions have been found to provide independent clock sources that would also work synchronously. However, faults in any of the clock channels may seriously impact the synchronization of the other clock sources, and thus undermine the operation of the entire system.
A clock channel fault may comprise an intermittent connection, a shift in the frequency of one of the clock channels due to environmental effects, or a component failure in the circuitry of one of the clock channels. In the worst case, one of the clock channels may fail completely, effectively terminating operation of the system to which it is connected as a time base.
Nevertheless, it is possible to partition the functions on a single ASIC in a way that makes the function it performs immune to virtually all forms of single failures, and many forms of double failures that can occur. The concepts used are redundant connections, and majority logic.
Redundant connections implies that each signal, power, ground, are all connected to the chip through several pins. The power and ground connections on the chip are made through a matrix of wires. The output signals are delivered to each customer card through several independent wires.
Majority logic is the concept of requiring at least three signals, any two of which (out of three) are required to satisfy the function. The simplest form of majority logic is a 2.times.3 AND-OR combination, that is three two way ANDs that are ORed together. Logically, if any two signals are `true` at the same time, the function is satisfied and a `true` is propagated. For example, a crystal oscillator input on the ASIC would be received through three pins, two of which had to agree, or the signal would be ignored.
Clearly, it is desirable to provide a redundant clock system that is able to tolerate a limited number of faults without loss of synchronization of the clock channels that continue to operate properly.
One technique for achieving fault tolerance is modular redundancy. Redundancy at component level, i.e. interdependent multiple clock channels within a circuit, rather than at system level is needed to obtain the required reliability. In a fault tolerant computing system that comprises multiple processors operating in lockstep using redundant clocks, the clocks must be synchronized in order for the computing system to be able to effectively compare data and mask out faults. A fault tolerant clock must be extremely reliable to meet the reliability requirement for its host fault tolerant computer. To maintain the synchronization and reliability of the clock it is important that the design be simple and require a minimal number of components.
Using the redundancy principles discussed above, a fault tolerant clock typically has three or more clock channels each comprising an oscillator having a feedback path that contains a majority voter to tolerate a single fault. The majority voter receives the outputs of all channels and provides a clock output signal that reflects the state of the majority of the channel outputs.
In the most common form of modular redundancy, three identical processors or machines are employed in a triple modular redundancy (TMR) configuration in which the processors work synchronously on the same task and their outputs are voted by hardware or software to provide a majority answer. For reliability and efficiency, real time clocking of the processors is preferably provided by employing a fault-tolerant hardware clock system comprising three redundant synchronized clock circuits and a majority voter to permit continued correct system operation with the loss of less than a majority of the clock circuits. This is possible because of the masking action of the majority voter. However, in a triple modular redundancy (TMR) system, if one clock circuit fails the system cannot tolerate a second failure.
The failure rate of a single ASIC (Application Specific Integrated Circuit) is normally calculated for the case of any single failure of any element that is used to create it. If anything breaks, it is considered a total failure event. Each chip I/O circuit, wire, module pin, etc. could be the source of the failure. About the best that any commercial has achieved is on the order of 10.sup.6 failures per 1000 power on hours.
It can seen then that there is a need for a single noninterruptable clock source to reduce system cost without reducing system availability.
It can also be seen that there is a need for an ultra high availability clock chip for use by multiple, synchronized card assemblies.