Dependable systems are characterized by their level of fault tolerance, safety, and reliability, amongst others. While those terms are often used in the literature interchangeably, there are profound differences among them. For example, reliability deals with component failures and their impact on the system availability. Failed components could conceivably cause unsafe system conditions, not necessarily due to the failure itself, but because of lack of sufficient constraints around the conditions caused by the failure.
Safety, on the other hand, is a control and or data related issue. Unsafe conditions occur because of lack of sufficient control within the system. Insufficient or invalid data could also create unsafe conditions.
The question as how to deal with an unsafe condition or a fault condition depends on the application and type of fault. In some control applications, a failsafe state is reached by shutting down the system or part of it when a critical fault occurs. Determining the root cause of the problem may not be the primary goal to reach, because as long as the system safely shuts down following a fault condition, that may be sufficient for failsafe operation. In other applications, system shut down is not necessarily a feasible solution. Instead, a set of actions ensuring transition to a safe operational condition is required.
Fault tolerance has traditionally been associated with defining a level of redundancy for the system components and/or connectivity between those components. While that has proved to be essential for safety-critical systems, it is a solution tailored mostly towards overcoming reliability problems and, in some cases, data related issues. Redundancy can be applied at both the hardware and software levels. Exemplary software failsafe techniques include complement data read/write, checksum comparison, redundant coding and orthogonal coding. The effectiveness of redundancy solutions on problems originating from the lack of sufficient control on system operation is quite debatable. That is because the lack of proper control laws must be the same in both primary and secondary systems. Furthermore, redundant solutions can only be made for those sources of problems that are known in advance.
Fault Analysis at Design & System Testing Stage
A common approach to develop fault tolerant or safe systems has been for the design team to spend an extensive amount of time and effort at the control system design stage, followed by extensive testing and validation after the system is built. There are a number of semi-formal techniques, mainly borrowed from reliability engineering, that can facilitate the above analysis. Examples include FTA (Fault Tree Analysis) and FEMA (Fault modes and effect analysis). FTA is a deductive technique used to identify the specific causes of potential undesirable conditions. The top event in the tree is a previously known potential condition. In principle, it is similar to backward diagnosis which utilizes a backward chaining logic in order to determine the sequence of events and conditions leading to a given failure. Such techniques are described, for example, in the following: D. Swaroop, “String Stability of Interconnected Systems,” PhD thesis, the University of California, Berkeley, 1994; S. Lapp and G. Powers, “Computer-aided Synthesis of Fault Trees,” IEEE Transaction of Reliability Engineering, Vol. 26, No. 1, pp. 2-13, April 1977; A. Misra, “Sensor-Based Diagnosis of Dynamical Systems,” PhD thesis, Vanderbilt University, 1994; R. De Vries, “An Automated Methodology for Generating A Fault Tree,” IEEE Transaction of Reliability, Vol. 39, pp. 76-86, 1990; and M. Sampath, “A Discrete Event Systems Approach to Failure Diagnosis,” PhD thesis, University of Michigan, 1995.
FEMA is an inductive analysis technique used to identify and evaluate potential undesirable conditions, and to determine control actions to eliminate or reduce the risk of those conditions. Another approach is the forward diagnosis approach, where state hypotheses are made in the model and updated according to current events and sensor readings until a conclusion is made. That approach is described by M. Steinder and A. S. Sethi, “Probabilistic fault diagnosis in communication systems through incremental hypothesis updating,” Computer Networks 45, 2004; A. Misra, “Sensor-Based Diagnosis of Dynamical Systems,” PhD thesis, Vanderbilt University, 1994; and P. Zhao, A. Amini, M. A. Jafari, “A Failure/Fault Diagnoser Model for Autonomous Agents under Presence of Disturbances,” IEEE International Conference on Systems, Man, and Cybernetics, 2006.
In the last decade there has been considerable R&D work on establishing more formal approaches to safety, fault and hazard analysis, but from a control perspective. One example is the work performed by an MIT group, described, for example, in N. Leveson, “A Systems-Theoretic Approach to Safety in Software—Intensive Systems,” IEEE Trans. on Dependable and Secure Computing (2005). The main thrust of that work is to minimize software faults and hazards at the design stage by formalizing software requirements, simulating and analyzing the design prior to implementation. In STAMP (Systems-Theoretic Accident Modeling and Process) methodology developed by that team, the cause of an accident, instead of being understood in terms of a series of failure events, is viewed to be the result of a lack of constraints imposed on system design and operations. STAMP uses concrete models of the underlying processes. It is assumed that any controller, human or automated, must contain a model of the system being controlled. Whether the model is embedded in the control logic of an automated controller or in the mental model of a human controller, it must contain the same type of information: the required relationship among the system variables (the control laws), the current state (the current values of the system variables), and the ways the process can change state. Accidents, particularly system accidents, are then viewed as inconsistencies between the model of the process used by the controllers (both human and automated) and the actual process state. The above methodology is now in use in some application areas, such as aerospace and aeronautics.
A similar paradigm has also been promoted by a different research community working on event driven systems modeling, such as Petri nets, NCES, temporal logic, and other state/transition systems. Generally speaking, a single model of the controlled closed loop behavior of the system is used, containing both the underlying process and the control actions. Through the analysis and evaluation of that model, one can identify, a priori, the undesirable states potentially reachable by the system. For instance, using Petri net theory, one can mathematically establish such conditions as un-boundedness, different levels of liveness, some deadlock conditions, etc. A comprehensive review of those methodologies is presented in L. E. Pinzon, H. M. Hanisch, M. A. Jafari and T. O. Boucher, “A Comparative Study of Synthesis Methods for Discrete Event Controllers,” Journal of Formal Methods in System Design, Vol. 15, No. 2, 1999.
While each stand alone agent (or system component or controller) is simpler to design and control, developing a safe and dependable distributed multi-agent system is far challenging than its centralized counterpart, due to the distributed nature of information and control. For distributed fault analysis, one must deal with a number of major issues: a) Which agent should initiate the fault detection/diagnosis? b) What information should be exchanged? How should the exchange proceed? Which agents should be involved? c) How should the same fault be avoided in the future? d) What information must be kept for avoidance, prevention or recovery in the future?
The extension of failure analysis to distributed systems has been done in the literature in two ways. First, centralized control is directly extended to distributed systems so that each agent has its own local analysis engine, while there is also a global coordinator that communicates with the local ones when necessary. Each agent is assigned a set of observable events, and it locally processes its own observation and generates its diagnostic information. Some of that information is then communicated to the global coordinator as needed. The type of information communicated is determined by the communication rules used by the agents. The task of the coordinator is to process the messages received from each agent, and to make proper inferences on the occurrences of failures, according to some prescribed decision rules. An example of the first approach is presented in R. Debouk, “Failure Diagnosis of Decentralized Discrete-Event Systems,” PhD thesis, University of Michigan, 2000.
In the second approach each agent receives and processes locally and from its neighbors. That approach is more suitable for distributed systems, but it can hardly outperform the centralized one, since none of the agents can have more information than the coordinator in the first approach. Examples of that approach include Y. Pencole, “Decentralized diagnoser approach: Application to telecommunication network,” in Proc. of DX 2000, Eleventh International Workshop on Principles of Diagnosis, pp. 185-193, June 2000; and R. Sengupta, “Diagnosis and Communication in Distributed Systems,” in Proc. of WODES 1998, International Workshop on Discrete Event Systems, pp. 144-151, Published by IEE, London, England, August 1998.
Online, Run Time Fault Analysis
The above approaches, though pro-active, assume that fault or undesirable conditions are determined in advance, either at the design or testing stage. The problem is that, in practice, very often the balance between the economics and safety or fault tolerance becomes important. Hence, design and pre-programming almost certainly concludes with not having included all the possible fault conditions. It is arguably true that for a reasonably complex system the problem of identifying all the fault conditions is “un-decidable.” Furthermore, in many real-life applications and especially in complex distributed systems, the occurrence of faults is often time-dependent, with their likelihood changing with the system's operational schedule. Furthermore, faults may occur due to inadequate coordination among different controllers or decision makers, including unexpected side effects of decisions or conflicting control actions. Communication flaws and invalid data could also play an important role in the occurrence of faults, with many of those circumstances quite difficult to predict in advance.
A recent alternative to the above approach has been to address safety issues in an “online” basis as faults occur. The majority of the work in that area has been in computing and computer networking. Examples include the concept of autonomic computing by IBM and the work done by the researchers at the Berkeley/Stanford Recovery Oriented Computing (ROC) Laboratory. The common underlying assumption in that work is that there is no way to prevent obscure and unexpected faults from happening, and not all those faults can be identified at the design stage. Autonomic computing refers to building computer hardware/software and networks that can adapt to changes in their environment, strive to improve their performance, heal when damaged, defend themselves against attackers, anticipate users' actions, etc. The work by the ROC lab centers around a similar concept but for distributed systems by making the system more adaptable to its environment, so that faults can be promptly detected, efficiently contained, and recovered from without requiring the whole system to shut down.
The online approaches improve system dependability by embedding a high level of intelligence into the system control. The potential non-determinism, however, in terms of response time and outcome, could arguably make those approaches unworkable for safety-critical applications. On the other hand, relying only on a pre-programmed automation, with some level of fault tolerance determined at the design or testing stage, will not make systems immune to unexpected faults and hazards, as many real life incidents have shown in the past. Therefore, it is the inventors' belief that in order to maximize system safety and dependability, one must take a hybrid approach: i) pro-actively attempt to identify as many fault and undesirable conditions as possible at design and system testing stages; and ii) embed some reasonable level of intelligence into the control system so that when a fault occurs, it is effectively detected and appropriate remedies are made at run time.
There is one more reason to justify the online approach. Suppose it is technologically possible to build a simulation of the underlying “plant” normative and disruptive dynamic behavior (to the extent known in advance). In the assumed distributed system, the plant model will be a network of individual agent models. Now suppose that the simulation includes an initial control design (call it “first level of approximation”) and have it work in closed loop with the simulated plant model. Assuming that the simulation runs for a sufficient amount of time, it should be possible to observe some unknown fault conditions. Now if the controller has embedded in it, the intelligence proposed by the inventors herein, it should be possible to detect, diagnose and avoid the observed fault conditions. With every newly discovered major fault condition, the control system is up-graded to a higher level of approximation.
There is therefore presently a need to provide a highly effective framework for fault detection and fault avoidance in multi-controller industrial control systems. To the inventors' knowledge, no such technique is currently available.