1. Field of the Invention
This invention relates to electronic computer systems, and more particularly to fault-tolerant or reliable electronic systems employing multiple processing units in order to reduce computational errors and/or determine the source of computational errors. The invention described herein may also be useful in supporting the development or investigation of improvements to components used in electronic systems employing multiple processing units.
2. Description of the Relevant Art
An electronic circuit such as a microprocessor may fail to produce a correct result due to xe2x80x9chardxe2x80x9d failures or xe2x80x9csoftxe2x80x9d errors. Hard failures are permanent and reproducible, and typically result from design errors, fabrication errors, fabrication defects, and/or physical failures. A failure to properly implement a functional specification represents a design error. Fabrication errors are attributable to human error, and include the use of incorrect components, the incorrect installation of components, and incorrect wiring. Examples of fabrication defects, which result from imperfect manufacturing processes, include conductor opens and shorts, mask alignment errors, and improper doping profiles. Physical failures occur due to wear-out and/or environmental factors. The thinning and/or breakage of fine aluminum lead wires inside integrated circuit packages due to electromigration or corrosion are examples of physical failures. Soft errors, on the other hand, are temporary and non-reproducible. Soft errors are often the result of transient phenomenon such as electrical noise (e.g., power supply xe2x80x9cglitchesxe2x80x9d and ground xe2x80x9cbouncexe2x80x9d), energetic particles (e.g., alpha particles), or xe2x80x9cmarginalxe2x80x9d circuit design.
Incorrect results cannot be tolerated in computer systems used in, for example, aircraft flight control systems, missile guidance systems, and banking transactions. Computer systems used in such critical applications must be highly reliable. One method used to increase the reliability of such computer systems is called functional redundancy checking (FRC). FRC typically employs two electronic microprocessor devices functioning as central processing units (CPUs). A first xe2x80x9cmasterxe2x80x9d microprocessor and a second xe2x80x9ccheckerxe2x80x9d microprocessor receive the same input signals and execute instructions simultaneously (i.e., in lock step). The checker microprocessor compares the output signals produced by the master microprocessor to its own internally-generated output signals. If any output signal produced by the master microprocessor does not match the respective output signal produced by the checker microprocessor, the checker microprocessor generates an error signal which initiates corrective action (i.e., xe2x80x9cnotificationxe2x80x9d).
FIG. 1 is a block diagram of a typical electronic computer system 10 employing FRC. Electronic computer system 10 includes identical first and second CPUs 12a and 12b, a processor bus 14, chip set logic 16, a memory unit 18, a memory bus 20, a system bus 22, and a peripheral device 24. CPUs 12a and 12b are typically microprocessor integrated circuits formed upon a single monolithic semiconductor substrate. Processor bus 14 couples both CPU 12a and CPU 12b to each other and to chip set logic 16. Chip set logic 16 functions as interface between CPUs 12a-b and system bus 22, and between CPUs 12a-b and memory unit 18. System bus 22 is adapted for coupling to one or more peripheral devices. Peripheral device 24 is coupled to system bus 22. Peripheral device 24 may be, for example, a disk drive unit, a video display unit, or a printer. Memory unit 18 stores data, and typically includes semiconductor memory devices. Chip set logic 16 is coupled to memory unit 18 via memory bus 20, and may include a memory controller.
CPUs 12a and 12b include built-in functional redundancy checking circuitry. During system initialization, either CPU 12a or CPU 12b is configured to be the master, and the other CPU is configured to be the checker CPU. The master CPU drives its output terminals, while the checker CPU changes its output terminals to function as input terminals. The respective terminals (e.g., xe2x80x9cpinsxe2x80x9d) of CPUs 12a and 12b are coupled together. The checker CPU compares its intemally-generated values to those produced by the master CPU and received at the respective terminals. If any output signal produced by the master CPU does not match the respective output signal produced by the checker CPU, the checker CPU produces an error signal. The error signal may serve as notification to external error recovery hardware (not shown). For example, the error signal may be routed to a third maintenance CPU (not shown) or an interrupt controller (not shown) which initiates an error recovery routine in response to the error signal. The error recovery routine may involve xe2x80x9cbacking upxe2x80x9d the software program running at the time the error occurred to an established xe2x80x9ccheckpointxe2x80x9d at which instruction execution may be reinitiated.
The master CPU initiates data read and write operations. In response to a memory read request from the master CPU, chip set logic 16 obtains data from memory unit 18 via memory bus 20 and provides the data to both CPU 12a and CPU 12b via processor bus 14. During a memory write operation, chip set logic 16 receives the data from the master CPU and stores the data within memory unit 18 via memory bus 20. In response to a read request from an address within an address range assigned to peripheral device 24, chip set logic 16 obtains data from peripheral device 24 via system bus 22 and provides the data to both CPU 12a and CPU 12b via processor bus 14. During a write operation to an address within an address range assigned to peripheral device 24, chip set logic 16 receives the data from the master CPU and provides the data to peripheral device 24 via system bus 22.
Several problems occur when implementing electronic computer system 10. Most importantly, the signals driven upon the output terminals of a CPU often do not adequately reflect the current internal execution state of the CPU. For example, there may be a time delay of many system clock cycles before an activity within the CPU results in signals being driven upon the output terminals. In addition, CPUs 12a and 12b may include relatively large internal cache memory systems 26a and 26b. Such cache memory systems are capable of holding large numbers of instructions and data. CPUs 12a and 12b are capable of operating for extended periods using instructions and data stored in respective cache memory systems 26a and 26b. During these extended periods, any computational errors produced do not propagate to the terminals of CPUs 12a and 12b, and are hence not xe2x80x9cvisiblexe2x80x9d for detection using FRC. As a result, cache memory systems 26a and 26b tend to delay error detection. Early detection of an error is key to determining the cause of the error and reducing the likelihood that valuable data is lost due to the error.
Furthermore, the maximum amount of data which may be transferred over processor bus 14 in a given amount of time (i.e., the maximum xe2x80x9cspeedxe2x80x9d of processor bus 14) is limited by the increased electrical loading of two CPUs and signal reflections within the signal lines of processor bus 14 due to the multiple connection points (i.e., terminations). Electronic computer system 10 does not support separate xe2x80x9cpoint-to-pointxe2x80x9d processor buses capable of much higher speeds.
It would be beneficial to have an electronic system and method implementing FRC by comparing xe2x80x9csignaturesxe2x80x9d generated by each CPU. Each xe2x80x9csignaturexe2x80x9d would include a relatively small number of bits, and would preferably be representative of the internal execution state of the CPU. Immediate comparisons of representative signatures would facilitate earlier error detection, especially when the CPUs include relatively large internal cache memory systems. In addition, comparing only such signatures would reduce processor bus loading and signal reflections caused by multiple signal line terminations, allowing the processor buses to transfer more data in a given amount of time (i.e., to be xe2x80x9cfasterxe2x80x9d).
The problems outlined above are in large part solved by an electronic system and method implementing functional redundancy checking (FRC) by comparing xe2x80x9csignaturesxe2x80x9d produced by each of two electronic devices, for example central processing units (CPUs). The signatures include a relatively small number of signals which are representative of the internal state (i.e., execution state) of each CPU. The electronic system includes a first CPU and second CPU. Each CPU is configured to execute instructions and to produce output signals. The first and second CPUs are preferably identical and execute instructions simultaneously such that their internal states and produced output signals are the same at any given time. Each CPU includes a signature generator for generating a signature representative of the internal state of the CPU. The electronic system also includes a compare unit coupled to receive the signatures produced by the first and second CPUs. The compare unit compares the signatures produced by the first and second CPUs and produces an error signal if the signatures are not identical. A compare unit may be integrated into each CPU, wherein only one of the compare units would be functional in a system employing multiple CPUs.
The electronic system may be, for example, a computer system, and may further include a system bus and chip set logic. The system bus may be adapted for coupling to one or more peripheral devices. The chip set logic may be coupled between the first and second CPUs and the system bus, and may function as an interface between the first and second CPUs and the system bus. The first CPU and the second CPU may be coupled to the chip set logic via separate processor buses. At least a portion of the signal lines of the separate processor buses may be xe2x80x9cpoint-to-pointxe2x80x9d, enabling the processor buses to achieve higher data transfer rates than the single processor bus of the typical computer system employing FRC in FIG. 1.
Each CPU may include a number of functional units, including a bus interface unit (BIU) which handles all data transfer operations for the CPU in accordance with established protocols. The BIU produces all CPU output signals coupled to the processor bus. In several embodiments, the signature generator is located within the BIU and generates a signature having a smaller number of signals than the number of output signals. Each signature signal may be, for example, dependent upon an internal state of a functional unit of the CPU.
For example, each CPU may include an integer and floating point functional units, and the signature generator of each CPU may generate a signature from current output signals produced by the integer and floating point units. In this case the signature produced by each CPU is highly representative of the internal state of the CPU, and the immediate comparisons of the signatures by the compare unit results in early error detection even when the CPUs include relatively large internal cache memory systems.
The present method of the detecting computational errors produced within an electronic computer system includes providing the first and second CPUs according to one of the embodiments described above along with the compare unit. The compare unit is coupled to receive the signatures produced by the first and second CPUs, and simultaneous instruction execution by the first CPU and the second CPU is initiated. Any difference in the signatures produced by the first and second CPUs represents a computational error and results in the generation of an FRC error signal by the compare unit.