In many systems which process information, particularly systems used for making critical decisions in real time, it is necessary that such systems have a high degree of reliability such that the allowable probability of a failure thereof per unit time is extremely low. Such systems, as may be used in highly critical applications, such as in aircraft, space vehicles, medical applications and the like, demand a substantially high level of processing performance. Such a performance includes not only the provisions of a high data throughput and large memory capability but also the ability to satisfy whatever unique requirements are imposed by the real time operational environment. Thus, the processing architecture must be designed to be capable of adapting itself to various requirements of the task being performed in real time.
Conventional redundant processing systems which can normally be used for many applications often do not have a sufficient degree of reliability to be safely used in highly critical applications, and it is desirable to provide more effective approaches to the problem of fault tolerance, particularly where more than one fault may have to be tolerated, e.g., in systems in which a single fault which arises cannot be corrected before another fault arises.
Further, such system should be designed to handle failures which may arise because of unpredictably arbitrary behavior on the part of one or more failed components. Such failures are often referred to as Byzantine faults, or as giving rise to "malicious" errors.
One such system which has been proposed is disclosed in our co-pending U.S. patent application, Ser. No. 07/187,474 filed on Apr. 28, 1988 concurrently with this application now U.S. Pat. No. 4,907,232 issued 3/6/90 and entitled "Fault/Tolerant Parallel Processing System", such patent being incorporated herein by reference.
One of the requirements for such systems is that redundant processing sites must be synchronized to within a known time skew in order to guarantee the most efficient operation thereof, particularly in the face of faulty behavior, as well as to allow the detection of a faulty, or an excessively slow, processing element in the system. In most such systems, constraints are imposed upon the processing elements by the needs for such synchronization and it is desirable that the synchronization mechanism used be effectively transparent to the applications program being implemented by the system and that the synchronization be suitable for use in a distributed system having multiple redundant processing sites.
One technique that can be used for such systems is often referred to as a "tight" synchronization process, e.g. processes using hardware synchronization mechanisms which constrain the operations of the redundant processing sites by requiring that such sites be deterministically related to the passage of time as measured by a hardware clock.
Such clock determinism constraint does not permit a wide degree of acceptable behavior on the part of the processing sites. For example, if each site has error-correcting memory and it is designed to synchronize the processing site channels by using a hardware determined clock, each channel must wait for the worst case error-correcting time period on each memory access because a given channel does not know when another channel might encounter a memory error and have to correct it. If each channel does not wait for such worst case time period then the error-correcting process, which could lengthen the number of clock cycles taken by one processor to execute a memory access, for example, would cause the processors to lose synchronization. Further, such a clock deterministic approach excludes the possibility of using relatively diverse hardware and software designs as a technique for tolerating common mode faults.
Another approach is found in software-based synchronization techniques rather than in the use of a hardware determined clock. Such software based synchronization techniques tend to constrain applications programmers by forcing applications code segments to fit within an arbitrarily imposed time frame which must in turn be sized in order to accommodate the worst case execution time of all of the processing tasks being executed in the frame. The need for software overhead, coupled with the fact that frame synchronous systems tend to utilize an excessive proportion of processor capability in executing synchronization and scheduling activities as opposed to processing activities, makes such software/frame-synchronous systems relatively inefficient and difficult to program. It is desirable to develop a synchronization system which does not depend on a hardware clock deterministic approach or on a software-based synchronization technique as in presently used systems.