1. Field of Invention
The present disclosure relates generally to the field of failure-avoiding computer systems. It relates more specifically to the sub-fields of failure analysis, fault identification, and failure avoidance.
2. Cross Reference to Issued Patents
The disclosure of the following U.S. patent is incorporated herein by reference:
(A) U.S. Pat. No. 5,522,036 issued May 28, 1996 to Benjamin V. Shapiro, and entitled, METHOD AND APPARATUS FOR THE AUTOMATIC ANALYSIS OF COMPUTER SOFTWARE.
3. Description of Related Art
In the art of computer systems, a failure event is one where a computer system produces a wrong result. By way of example, a computer system may be processing the personal records of a person who was born in the year 1980 and may be trying to determine what the age of that person will be in the year 2010. Such an age determination might be necessary in the example because the computer system is trying to amortize insurance premiums for the person. Because the computer system of our example is infected with a so-called xe2x80x98Y2Kxe2x80x99 bug, the computer incorrectly determines that the age of the person in the year 2010 will be negative seventy instead of correctly determining that the person""s age will be positive thirty.
The failure event in this example is the production of the xe2x88x9270 result for the person""s age. The underlying cause of the failure result is known as a fault event. The fault event in our example might be a section of computer software that only considers the last two digits of a decimal representation of the year rather than considering more such digits.
The above is just an example. There are many possible computer operations that may be characterized as a fault event where the latter eventually causes a failure event. Faults can be hardware-based or software-based. An example of a faulty piece of hardware is a Boolean logic circuit that produces an output signal having a small noise spike. Generally, this noise spike does not affect output signals of the computer system. However, if conditions are just right, (e.g., other noises add up with this small noise spike), the spike may cause a wrong output state to occur in one of the signals of the computer. The production of such a wrong output state is a failure. The above-described Y2K problem is an example of a software-based fault and consequential failure.
It is desirable to build computer systems that consistently output correct results. This generally means that each of the operational hardware modules and executing software modules needs to be free of faults.
In general, producing fault-free software is more difficult than producing fault-free hardware. Techniques are not available for proving that a given piece of computer software is totally fault-free. Software can be said to be fault-free only to the extent that it has been tested by a testing process that is itself fault-free. In real life applications, exhaustive testing is not feasible. Even a single, numerical input to a program may create a requirement for testing numerous possibilities in the range from minus infinity to plus infinity. If there are two such inputs, they may create a need for a two dimensional input testing space of infinite range. Three variables may call for a three dimensional input space, and so on. If one attempts to exhaustively run all the input combinations it will take so much time that the utility and need for the application program may be already gone.
In the mechanical arts, it is possible to make a mechanical system more reliable or robust by designing various components with more strength and/or material than is deemed necessary for the predicted, statistically-normal environment. For example, a mechanical bridge may be made stronger than necessary for its normal operation by designing it with more and/or thicker metal cables and more concrete. The added materials might help the bridge to sustain extraordinary circumstances such as unusually strong hurricanes, unusually powerful earthquakes, etc.
If there is a hidden fault within a mechanical structure, say for example that internal chemical corrosion creates an over-stressed point within one cable of a cable-supported bridge, the corresponding failure (e.g., snapped cable) will usually occur in close spatial and/or temporal proximity to the fault. The cause of the mechanical failure, namely the chemical corrosion inside the one cable, will be readily identifiable (in general). Once the fault mechanism is identified, the replacement cable and/or the next bridge design can be structured avoid the fault and thereby provide a more reliable mechanical bridge.
Computer software failures are generally different from mechanical system failures in that the software failures do not obey the same simplified rules of proximity between the cause (the underlying fault) and effect (the failure). The erroneous output of a computer software process (the failure) does not necessarily have to appear close in either time or physical proximity to the underlying cause (fault).
A number of so-called, fault-tolerant techniques exist in the conventional art. A first of these techniques applies only to hardware-based faults and may be referred to as xe2x80x98checkpoint re-processingxe2x80x99. Under this technique, a single piece of hardware moves forward from one operational state to the next. Every so often, at a checkpoint, the current state of the hardware is stored into a snapshot-retaining memory. In other words, a retrievable snapshot of the complete machine state is made. The machine then continues to operate. If a hardware failure is later encountered, the machine is returned to the state of its most recent checkpoint snapshot and then allowed to continue running from that point forward. If the hardware failure was due to random noise or an intermittent circuit fault, these faults will generally not be present the second time around and thus the computer hardware should be able to continue processing without encountering the same failure again. Of course, if the fault is within the software rather than the hardware, then re-running the same software will not avoid the fault, but rather will merely repeat the same fault and will typically manifest its consequential failure.
A second of the so-called fault-tolerant techniques may be referred to as xe2x80x98majority votingxe2x80x99. Here, an odd number of hardware circuits and/or software processes each processes the same input in parallel and produces a respective result. In the case of the software processes, it may be that different groups of programmers worked independently to encode solutions for a given task. Thus, each of the software programming groups may have come up with a completely different software algorithm for reaching what should be the same result if done correctly.
When the different hardware and/or software processes complete their operations, their results are compared. If the results are different, then a vote is taken and either the majority or greatest plurality with a same result is used as the valid result. This, however, does not guarantee that the correct result is picked. It could be that the majority or winning plurality is wrong, despite their numerical supremacy. The voting process itself may be the underlying cause for a later-manifested failure. This is an example showing that adding more software (e.g., coding and executing different versions of software) to software does not necessarily lead to more reliable and fault-free operation.
Software systems are often asked to operate in input space which has not been previously encountered. A crude analogy is that of an automated spaceship moving forward in space towards uncharted regions. The spaceship encounters a new situation that was not previously anticipated and tested for. The question is then raised, are we going to return the spaceship to Earth to reprogram it? And if so, what are we going to reprogram it to deal with? We have not allowed it to operate into the unknown future yet and thus we have not yet experienced the future set of inputs with which we want to deal. It is only by actually going forward that we can observe and analyze the spaceship""s behavior or the behavior of the ship""s software systems. But are we going to risk malfunctioning of the ship""s software systems or the destruction of the ship?
In view of the above, it is seen that significant problems exist in the software arts. There is a need for computer structures, systems and methods which can better avoid failures during execution.
A computer system in accordance with the invention includes an Advanced Software Processor (ASPr) and a Trailing Software Processor (TSPr). The ASPr is allowed to move forward along a stream of process events ahead of the TSPr. Process events can include an execution of either a statement in a source code file, or an execution of an opcode (assembly-level statement) within an object code file, or an execution of a SUM-Object code segment as the latter is defined in the above-referenced U.S. Pat. No. 5,522,036.
In accordance with the invention, a so-called xe2x80x9cTarget Software Processxe2x80x9d (TSP) is replicated within a computer system to define an xe2x80x9cAdvanced Software Processxe2x80x9d (ASP). The ASP generally executes ahead of the TSP on a common stream of process events. The TSP is permitted to continue its executions, while trailing behind the ASP by a safe distance. As long as the ASP does not encounter a failure event, the TSP is permitted to continue moving forward as well.
In one embodiment, each time the ASP passes through a predefined one of plural filters, the ASP signals that a failure has not yet been encountered. In response, the state of a previous safety-stoppoint is flipped from that instructing the TSP (Trailing Software Process) to stop to one that permits the TSP to proceed through. The TSP thereby moves forward from behind one safety-stoppoint to a next with the confidence that the ASP has already passed through to a future filter without experiencing a failure.
One or more failure-recognizing filters are provided and coupled to a corresponding one or more outputs of the ASP for recognizing failure events of the ASP. If a failure is recognized to have occurred in the ASP, then the TSP is preferably instructed to immediately pause its operations. Alternatively or additionally, permission is withheld from the trailing TSP to proceed forward through a next of its safety-stoppoints.
In response to the recognition of a failure event within the ASP, a knowledge-deduction (KD) process is initiated for identifying the probable point of fault within the past processing of the ASP which led to the manifestation of the recognized failure. The knowledge-deduction process may be carried out as described in the above-referenced U.S. Pat. No. 5,522,036.
If the knowledge-deduction process locates an area of correctness-uncertainty (a possible fault), then the ASP is returned to a previously-saved, checkpoint state that occurs prior to the possible-fault event or alternatively to a point of process origin.
If the identified fault is of a type which is known, and a predefined solution exists for this type of fault, then the predefined solution is applied to the ASP. The ASP is then allowed to proceed forward from the point it was returned to. In the meantime, the TSP (Trailing Software Process) should be stopped until the ASP (Advanced Software Process) succeeds in moving through the process, this time without detection of a failure.
If a predefined solution to the identified fault is either not known or there is an uncertainty about a proposed solution, then trial and error may be performed where the ASP proceeds through applied test solutions one or more times until failure is no longer encountered.
Thus the ASP acts as an advanced scout which experiences the future, and its possible failures, while generally protecting the TSP (Trailing Software Process) from experiencing the same failures. Output devices are attached to the TSP and thus do not exhibit failures caught by the ASP.
Other aspects of the invention will become apparent from the below detailed description.