This invention relates to system analysis methods regarding verification and validation procedures and more particularly to a system analysis method which has the ability to detect latent hardware and software design defects which defects could cause unanticipated critical failures of digitally controlled systems.
The following references provide an overview of the precedent software fault-tree analysis and hardware fault-tree analysis technologies which are pertinent to the present system procedures as will be described.
Harvey, Peter Randall. "Fault-tree Analysis of Software". University of California, Thesis 1982.
Levenson, Nancy G., and Janice L. Stolzy. "Safety Analysis of Ada Programs Using Fault trees." IEEE Transactions of Reliability vol. R-32 No. 5 (December 1983; 470-579.
McIntee, Capt. James W. "Soft Tree: Fault-tree Technique as Applied to Software." Air Force Systems Command, Eglin Air Force Base, Fla. October 1983.
Taylor, J. R. "Status of Software Safety Verification and Validation Techniques." Lafayette, Indiana, U.S.A. Notes prepared for SAFECOMP 82, October 1982.
Taylor, J. R. "An Algorithm For Fault-Tree Construction." IEEE Transactions on Reliability vol. R-31, No. 2, (June 1982), 137-146.
Taylor, J. R. "Fault Tree and Cause Consequence Analysis for Control Software Validation."
Roskilde, Denmark: Riso National Laboratory, January 1982.
Taylor J. R. "Logical Validation of Safety and Control Systems Specifications Against Plant Models." Roskilde, Denmark: Riso National Laboratory, May 1981.
As one can ascertain and as will be further described, there has been widespread investigations of system analysis techniques.
As one can also ascertain, the prior art is replete with many examples of digitally controlled systems. Such systems extensively employ microprocessors, microcomputers, minicomputers and so on which may be distributed throughout the system which microprocessors contain extensive programs which operate to control system operation. Examples of such systems are telephone switching systems as well as many other systems utilized by the military such as digital automatic flight control systems and large scale digital communication systems in general.
As one can ascertain, system reliability is a major risk area in the development of highly integrated and complex electronic systems such as the above-noted systems as well as many systems which will be employed and are presently employed in defense electronics. The problems inherent in the design and utilization of such systems can be best understood if they are segmented between system specification and software reliability, hardware reliability and the interaction of hardware and software failure.
Errors in system specification and software are major current impediments to the achievement of system reliability. Latent specification and software failure present a serious risk to the successful development of such advanced and complicated systems. Many people who are familiar with such systems believe the problem to be so severe that it rules out the practicality of large complex software driven systems. According to this view, such systems as digital automatic flight control systems, digital telephone exchanges and digital communication systems will inevitably be released with innumerable software problems which problems will escape detection during test and integration. These problems will only manifest themselves during system operation and only when a particular aspect of the system is accessed. These possible critical errors in software system specification or coding cannot be detected until the particular threat scenarios which exercise them are encountered. Thus, as one can ascertain, such software problems or "bugs" can result in catastrophic system failure.
This problem does not only apply to future generation advanced systems but is applicable to present systems. It is clear that software reliability is a major risk item with virtually all complicated digital system acquisition techniques. Over the past year, several important techniques such as structured programming, top-down design, and self-documenting code have been developed to improve software reliability. These improvements however have not kept pace with the complexity of even current system designs.
It is clear that even utilizing all of the currently available software design verification and validation methods, delivered systems are released to customers containing numerous software specification and coding errors. It is therefore quite common for the test and debugging phase of the system to consume more time in resources than the actual initial development and design.
A related concern which will be discussed further is the fact that none of the current reliability analysis techniques can determine when the debugging phase is truly over. Essentially, what is meant by this is that in present reliability analysis techniques one cannot be assured that no undetected mission critical specification or coding errors do not remain in the design. The fundamental difficulty with software reliability is the fact that it is not possible to prove the correctness of most useful real time application software. There are simply too many possible permutations and combinations of input conditions to allow one to trace the results of each of the many sets of conditions through the software to determine system operation. In this respect a simple guidance and control or electronic warfare system has at least 10.sup.18 possible paths which are available through software routines. Even if each test could be generated, run and evaluated in a microsecond, it would take 30,000 years to test every conceivable path. There is no assurance that an untested software route, either alone or in combination with a hardware failure, will not result in a critical system failure.
Although very small elementary programs can be mathematically proven, the proof of any realistic system application software is theoretically and, practically unfeasible. Since the correctness of realistic system software cannot be proven, the only alternative has been to test software in an effort to discover latent errors. While the state of the art in static and dynamic program testing can discover many software errors, since only a tiny fraction of the possible sets of input conditions can be tested, there is no assurance given by these techniques that a considerable number of "bugs" does not remain.
Attempts have been made in the prior art to utilize the statistical approach to develop various predictive models for software error which produce a software error "MTBF" (Mean Time Between Failure). However, due to the very great differences between the causes and the mechanisms of hardware and software failures, these attempts to carry over concepts developed for predicting hardware failure rates to the software domain are necessarily highly contentious. None of the statistical models which have been proposed for prediction of software failure rates has demonstrated general applicability.
In addition, these statistical models only attempt to derive the probability that no more than a set number of errors remain in the design. They give no assurance as to the criticality of these residual errors. Essentially, as one can ascertain, the fundamental problem with current software verification and validation (V & V) techniques is that all possible inputs cannot be tested for all possible paths. Unfortunately many current software verification techniques focus on the detection of coding errors in small program modules. There is considerable evidence however that the majority of software "bugs" are not coding errors but specification errors to which these testing techniques are completely insensitive.
This fact also impacts the value of N-version programming as a software reliability tool since it increases the likelihood of common mode failures. As systems become more highly integrated, it is the specification type of software errors which will increasingly predominate. In regard to the above and in the area of hardware reliability, fault tolerance achieved through system reconfiguration upon real time detection of failure is a promising technical approach. One difficulty with system reconfiguration is that there must be an appropriate selection of which hardware resources will be protected from failure. It is generally not possible, desirable, or cost feasible to implement sufficient backup sparing so that the effects of every possible hardware failure can be masked through system reconfiguration.
The most obvious candidates for protection might not be the ones that best insure survivability. A great deal of effort and expense might be expended in application of subassemblies or other major system resources only to have the potential for critical failure remain. This could easily result from a simple but critical element required for successful reconfiguration. For example, a power supply, a switching element or an interconnecting bus may be overlooked. If such structures are overlooked and not protected from a single point failure, they can result in complete system failure.
Alternatively, a transient failure of what was assumed to be a noncritical component would cause the software to exercise a completely unanticipated path which results in a catastrophic system output. Clearly, some methodology is required that will exhaustively search for all of these possibilities and insure that the mission critical functions are protected by the system reconfiguration scheme.
Finally, an additional important difficulty with currently employed techniques for determining software and hardware reliability is that both the system software and hardware are analyzed independently of each other. During a hardware reliability analysis (e.g. FMECA) Failure Mode Effects and Criticality Analysis, it is falsely assumed that all software will function perfectly. During software verification and validation, it is falsely assumed that the software will be executed in a hardware environment which is flawless. In actuality a hardware failure can cause totally unanticipated inputs to the software which might have disastrous consequences.
Conversely, a software error in the BIT (Built-In-Test) or FDI (Fault Detection and Isolation) system or an error copied into all versions of redundant hardware can make system reconfiguration useless. The false disassociation of software and hardware reliability analysis which currently prevails prevents the optimization of system reliability in those functions where failure could have the most critical effect. A much more accurate and fruitful approach would entail simultaneous analysis of the combined hardware and software operation of the system. Hence, as one can ascertain and in summation, current design and reliability analysis techniques permit the deployment of systems with undetermined potential for critical failure.
This is particularly the case with regard to specification and software reliability. The intrinsic limitations of current techniques will result in the above-described situations and can be summarized as first the inability to cope with combinatorics of system failure potential in software driven systems; secondly, the extreme difficulty in analyzing the effects of multiple failures; and third the inability to analyze hardware and software failure interactions as well as the inability to adequately determine system reconfiguration coverage requirements.
It is therefore an object of the present invention to employ a system analysis method which method has the unique ability to detect latent hardware and software design defects that could otherwise cause the unanticipated critical failure of a digitally controlled system.
It is a further object to provide system integrated fault-tree analysis (SIFTAN) methods which operates to modify and integrate existing system analysis techniques such as hardware fault-tree analysis (HFTA) and software fault-tree analysis (SFTA) to provide an integrated technique for verifying complicated digital system operation prior to releasing the system.