Reliable embedded computing systems are vital to every sector of our economy and daily lives. They are frequently relied upon in mission-critical applications where safety of human life and material assets are at risk. Meanwhile, the worldwide market for programmable logic devices is about $3.5 billion and is forecasted to grow to approximately $4.5 billion in 2006 and $5.2 billion in 2007. Furthermore, field programmable gate arrays have also been displacing microcontrollers via processor softcores that configure only the necessary arithmetic/logic functional units within an embedded application.
Autonomous repair of field programmable gate arrays (FPGAs) is of particular interest in aerospace applications for both in-flight and Ground Support Equipment devices. SRAM-based FPGAs are of significant importance due to their high density and increasing use in mission-critical/safety-impacting applications. Meanwhile, they offer unlimited reprogrammability that can enable autonomous repair.
For in-flight applications, FPGA devices encounter harsh environments of mechanical and acoustical stress during launch, high doses of ionizing radiation, and thermal stress. Simultaneously, they are required to operate reliably for long mission durations with limited or absent capabilities for diagnosis/replacement and little onboard capacity for spares. Hence, recent research has focused on employing the reconfigurability inherent in field programmable devices to increase reliability and autonomy as described in D. Keymeulen, A. Stoica, and R. Zebulum, “Fault-Tolerant Evolvable Hardware using Field Programmable Transistor Arrays,” IEEE Transactions on Reliability, (September 2000) Vol. 49, No. 3; S. Vigander, Evolutionary Fault Repair of Electronics in Space Applications, Dissertation, Norwegian University Sci. Tech., (Feb. 28, 2001), Trondheim, Norway; M. Abramovici, J. M. Emmert, and C. E. Stroud, “Roving STARs: An Integrated Approach To On-Line Testing, Diagnosis, and Fault Tolerance For FPGAs in Adaptive Computing Systems,” NASA/DoD Workshop on Evolvable Hardware, (2001); J. D. Lohn, G. Larchev, and R. F. DeMara, “A Genetic Representation for Evolutionary Fault Recovery in Virtex FPGAs,” In Proceedings of the 5th International Conference on Evolvable Systems (ICES), Trondheim, Norway, March 17-20, 2003; and J. D. Lohn, G. Larchev, and R. F. DeMara, “Evolutionary Fault Recovery in a Virtex FPGA Using a Representation That Incorporates Routing,” In Proceedings of 17th International Parallel and Distributed Processing Symposium, Nice, France, Apr. 22-26, 2003.
Ideally, recovery would be performed with the faulty device remaining online whenever possible, but few attempt this. Using Roving Self-Test Areas (STARS), testing and diagnostic process takes place in the FPGA without disturbing the normal system operation. The entire chip is tested by roving the STARs across the FPGA. The STARS multi-level fault-tolerant technique allows using partially defective logic and routing resources for normal operation and providing longer mission life in the presence of faults. In addition, the dynamic fault-tolerant method ensures that spare resources are always present in the neighborhood of the located fault, thus simplifying fault-bypassing. However, effective use of STARS requires spare resources to be available to use as substitute resources when faults are detected. A problem encountered with STARS is that the quality of recovery is restricted by a fixed routing scheme that cannot adapt and detection latency for faults can be large.
Vigander's and Lohn's methods exhibit likelihood of recovery related to the FPGA's design complexity. In other words, they attempt to design an original repair where only a single failed configuration is available for adaptation. While the quality of recovery under evolutionary approaches cannot be guaranteed, static redundancy approaches like Lach's are either completely recovered or completely beyond recovery.
Evolutionary mechanisms can actively restore mission-critical functionality in SRAM-based reprogrammable devices. They provide an alternative to device redundancy for dealing with permanent degradation due to radiation-induced stuck-at-faults, thermal fatigue, oxide breakdown, electromigration, and other local permanent damage. Potential benefits include recovery without the increased weight and size normally associated with spares. Also, failures need not be precisely diagnosed due to intrinsic evaluation of the FPGA's residual functionality through assessment of the Genetic Algorithm (GA) fitness function.
Other prior art that is made of record include U.S. Patent Publication No. 2005/0154552 published on Jul. 14, 2005, discloses an emulation system for testing FPGAs that includes testing during the manufacturing process with roving self-test areas in different configurations. A major limitation is that the device does not include self-testing and reconfiguration in real time during normal operation of the device.
U.S. Patent Publication No. 2005/0071716 published Mar. 31, 2005, describes methods and system for verifying functionality of the logic elements and the reconfigurable interconnections prior to operational use of the device.
U.S. Pat. No. 6,874,108 issued on Mar. 29, 2005, describes a method of fault tolerant operation for a FPGA. The system tests an area of the device, identifies a fault and reconfigures the FPGA including estimating signal path delays and adjusting clock period or speed if required.
U.S. Pat. No. 6,839,873 issued on Jan. 4, 2005, describes a PLD having a built-in test function for testing the PLD which either passes or fails. There is no redundant circuit for self-repair or reconfiguration and the testing function is limited to during manufacturing and at start up to confirm device integrity. No provision for self-test during normal operation.
U.S. Pat. No. 6,718,496 issued on Apr. 6, 2004, describer a semiconductor device having an internal circuit to test, a redundant circuit for repairing the internal circuit and a test and switching for conducting the test and making the repair. The system also tests and changes the operating parameters including timing, input voltage to test semiconductor under different operating conditions. Testing is only performed during start-up operation and thus does not provide fault detection and reconfiguration during online operation.
U.S. Pat. No. 6,668,237 issued on Dec. 23, 2003, describes a testing system for testing PLDs. The test system, including hardware and software, is external to the device. The device is reconfigurable. Test and reconfiguration are not real-time and are not performed during normal operation of the device.
U.S. Pat. No. 6,550,030 issued on Apr. 15, 2005 and U.S. Pat. No. 6,530,049 issued on Mar. 4, 2003, describe an external self-test for FPLAs. The testing controller and memory are external. The test circuit configures a test area of the FPLA the reconfigures the FPLA for testing a next area until the entire FPLA is tested. The prior art fails to provide apparatus, methods, systems or devices for autonomous fault handling and self repair of programmable logic devices in-situ.
As for the actual error detection circuitry, Current approaches to fault-tolerant error detection can be broadly classified into coding-based approaches and redundancy based approaches. Concurrent Error Detection (CED) schemes are a general class of fault tolerant schemes that fall into either or both of these categories. In general, the operating principle of most CED schemes can be described as comparing some special characteristic of the actual output of a system that realizes a function in response to an input to a special characteristic computed by an alternate, predicted, output characteristic of the function in response to the same input. The special characteristic in question could be the parity of the output, a count of the 1's or 0's in output, or the conformance to pre-specified codes, such as the Berger code, or a specific m-out-of-n code disclosed in Mitra, S. and E J McCluskey, “Which Concurrent Error Detection Scheme To Choose?,” Proc. International Test Conf., (2000) pp. 985-994. The comparison between the predicted output and the observed output is carried out using a comparator element, which is a hardware detector or error checker.
The redundancy in CED schemes can be spatial, where the hardware circuit-under-test is duplicated, or temporal, where the outputs may be buffered for future comparison. The comparator used in CED schemes vary in architecture depending on requirements. For example, the hardware comparators may be capable of bitwise output comparison, such as the two-rail checker described in E. J. McCluskey, “Design Techniques for Testable Embedded Error Checkers”, IEEE Computer, July 1990.
Equality checkers or matchers compare two input words to determine whether corresponding bits from the words have the same value. The equality checker consists of series of XOR gates whose output should always be 0 when both inputs, from the two input words being compared, are equal. The outputs of the XOR gates comprise the inputs to an OR gate, whose output should be a zero as long as none of its inputs are a one. McCluskey shows that these circuits need complemented inputs, and a two-rail checker to be verifiable as a fully self-testing system. A two-rail checker is a circuit that checks that each pair of inputs has complementary values, and this is used to convert n pairs of signals into one pair of signals that are complementary if and only if all of the n input pairs are complementary. The two-rail checker can be made testable with the addition of a test signal, and two XOR gates.
The prior art checkers described are essentially only testable error checkers. In other words, they can be tested and verified to be error-free before being utilized in a larger system. These self-testing checkers, or more precisely, testable error checkers provide no guarantee of tolerating a fault which affects the checker if the fault occurs after the system is put into service. Another problem is that these checkers are essentially checking for the presence of invalid code words in the output of the circuit-under-test for at least one valid code-word input to the checker—they do not address the case where a component in the circuit-under-test may fail completely, after the system is placed in service. Subsequent work in CED schemes rely upon comparators that have been designed with this philosophy—that of self-testing checkers being defined as checkers that are testable using the checker itself, and some test input comprising of non-code words.
Triple Modular Redundancy TMR systems provide fault-tolerance capability by utilizing three functional replicas and a voter that chooses the majority output. The majority output is propagated in the hope that faults, if any, do not affect more than one of the three functional modules. This works satisfactorily in the case of a single-fault assumption, unless the fault induces a Common Mode Failure.
Voter designs include bitwise voters are majority voting systems that calculate the majority by comparing the output of the three modules in a bitwise fashion, and word-voters which compare entire variable-length words to arrive at a result as disclosed in Wei-Je Huang, Subhasish Mitra, and Edward J. McCluskey, “Fast Run-Time Fault Location in Dependable FPGAs”, Center for Reliable Computing, Stanford University and Mitra, S., and E. J. McCluskey, “Word-Voter: A New Voter Design for Triple Modular Redundant Systems,” 18th IEEE VLSI Test Symposium, (Apr. 30-May 4, 2000) pp. 465-470, Montreal, Canada. TMR systems with voters have a single point of failure—the voting element. If the logic elements used for the construction of the voter are subject to a failure, then the whole system would fail. Using redundant voters is a poor alternative, as it improves fault-tolerance, but does not guarantee the reliability of the results produced.