1. Field of the Invention
The present invention relates to systems and methods for detecting and responding to errors and failures in a memory device, and particularly, for such systems and methods in space applications.
2. Description of the Related Art
Computer memory and other semiconductor components are susceptible to environmental effects which can cause them to fail. One class of failures occurs as a result of exposure to radiation. The environmental conditions for space applications present radiation which produces this class of failures. Such radiation can be devastating to a satellite lacking adequate safeguards. When cosmic radiation passes through a sensitive semiconductor component in a satellite, one of three possible conditions may result.
In a microprocessor or RAM chip, a single-event upset (SEU) can occur wherein the contents of a particular memory address or register are inverted (e.g. a bit flips from 0 to 1). As a result, sensor data can be corrupted, algorithms can fail, and the satellite firmware can be adversely affected. A corrupted program could attempt to execute random code or data in the memory may be lost.
The second condition is a single-event latchup (SEL). In this case, the affected component latches into a state where it dissipates a dangerously high amount of current, until the power to the device is reset. If the current is not limited, the system power supply may also fail, or its voltage may dip down below acceptable levels for normal system operation, affecting many other major onboard systems. Also, if the device is not rated for the high current dissipation, it may be destroyed.
The third condition induced by cosmic radiation is a single-event burnout (SEB). In this case, the affected device is destroyed immediately following exposure. Unlike SEUs and SELs (where the device is not destroyed and may be reset), the only adequate response to an SEB is to invoke a redundant device.
Furthermore, different semiconductor devices have different susceptibilities to radiation induced failures. Some device designs may reduce (or virtually eliminate) the risk of a radiation induced failure, however, it is often not reasonable to apply such techniques to every semiconductor device. In general, the higher the capacity of a memory device, the more susceptible it is to failures, including latchup. Thus, very high capacity memory devices, e.g. 64 Mbit devices, have a relatively high susceptibility. Therefore, systems and methods to protect these devices are especially important.
FIG. 1 is a block diagram of a typical prior art system 100 for latchup detection and mitigation. The system 100 includes hardware detection and reset components entirely separate from the software and other operations of the computer system 102 which it monitors. The monitored computer system 102 includes a central processing unit (CPU) 104, one or more memory devices 106, such as silicon based SDRAM and input/output devices 108 which are used to monitor and control various subsystems. The CPU 104 utilizes the memory 106, comprising one or more memory devices 106A, 106B, to store programs data and information which are being processed and used by the computer 102. Program data and information are transferred between the CPU 104 and memory 106 via the data bus 110 as the computer 102 operates.
The latchup detection and mitigation system 100 operates by monitoring the current consumption of the memory 106 via links 112. Harmful radiation 114 may impinge at least one of the memory devices 106A, causing a single event latchup (SEL) in the memory device 106A. As a result, the latched up memory device 106A begins to draw an excessive amount of current from the memory power supply 116. The current measurement hardware 118 is continually monitoring the current draw by the memory devices 106 from the power supply 116 and relays the information to the threshold detection hardware 120. When an unsafe threshold is reached by any of the memory devices 106, the detection hardware 120 signals a reset to power supply for at least the affected memory device 106A. For simple processor designs in which the power supply powers both the memories and the processor, the power supply reset will shut down power to the entire processor 102.
The additional hardware adds to the cost and mass of the overall computer system 100. In addition, the hardware of the described system 100 increases the complexity and reduces the reliability of the computer system 102. Furthermore, this system 100 only detects and eliminates SELs that result in an excessive current draw which could damage or destroy hardware. It does not check for SEUs or other innocuous memory failures which do not result in a high current draw. Finally, because the system is hardware based, it is not easily or inexpensively altered to meet a change in requirements or to implement improvements.
There is a need for systems and methods which can detect and respond appropriately to single event failures of any type. If a memory device latches up so that it completely fails, power needs to be removed from it in a timely manner, even if that means immediately shutting down the entire processor. On the other hand, if the memory experiences a SEU, the system and method need to correct the error(s) without interrupting the functionality of the processor. Furthermore, there is a need for such systems and methods to function without requiring additional hardware components. There is also a need for such systems and methods to be inexpensive, reliable, light and easily modified. The present invention meets all of these needs.
The present invention discloses an apparatus, method and article of manufacture for detecting memory device failures. The exemplary method comprises detecting errors in data stored in a memory device from the data transacted with a processor, correcting the detected errors in the data transacted with the processor, tracking the detected errors in the memory device, determining when the memory device has failed based upon the tracked detected errors and resetting the memory device when the memory device fails testing. Errors can be corrected such that no erroneous data is transacted with the processor.
In one embodiment, the error detection and correction is carried out by a hardware logic device on the data bus, and the failure determination and resetting are performed by software.
The invention tracks how frequently error correction is required and uses this information to determine if the memory device has failed. When a memory device failure is determined, the invention resets the memory device by signaling a power supply of the memory device to cycle. Errors will appear as a result of ordinary data transactions between the processor and memory device as it operates. The invention also identifies erroneous latchups as latchups detected soon after powering. In this case the indicated latchup is ignored.
In one embodiment, the invention also affirmatively tests the memory device, e.g. by periodically performing a write operation of test data to the memory device, followed by a read operation of the test data from the memory device. A failure of the memory device is determined based upon error correction required in response to the test (e.g. the read operation). However, errors in the test data are corrected such that no erroneous test data is transacted with the processor.
The present invention responds to memory device errors (e.g. SEUs) as well as failures (e.g. SELs). The error correction logic monitors the overall xe2x80x9chealthxe2x80x9d of the data stored within the memory device. This monitoring is facilitated through periodic testing (e.g. read/write operations). When error correction for a memory device becomes excessive, indicating a failure beyond the scope of a simple SEU, a failure is deduced and a memory reset is performed.