In the field of integrated circuit memory storage, a memory storage cell includes an arrangement of semiconductor components on a wafer; the combined operation of which defines one of a logic high and a logic low memory storage cell state.
It is very desirable for memory storage cells to have a fast state change in order to provide fast memory writes and therefore fast memory access. The state of each memory storage cell is defined by electromagnetic characteristics. Electric currents, electric potentials, electric fields, magnetic fields, etc. stronger than naturally occurring ones, are employed to intentionally change and maintain memory storage cell states. A balance must be struck, as stronger electric currents, currents, potentials, electric fields and magnetic fields reduce the speed of the integrated circuit memory and the expended power. Market pressures have pushed the development of integrated circuit memory storage towards high density miniaturized micropower integrated circuit memory storage devices operating just above reasonably shieldable average naturally occurring electric currents, potentials, electric fields and magnetic fields. The envelope of the possible and usable is constantly pushed through miniaturization.
A soft-error, also known as single event upset, is a memory bit error in an integrated circuit memory storage device caused by unintended uncontrollable phenomena; typically natural phenomena such as the chance incidence of, radiation, high-energy neutrons or cosmic rays; non-intentionally subjecting memory cells of the silicon memory storage device to a significantly stronger electric current, potential, field or magnetic field, so as to induce a state change of at least one memory cell, typically corrupting bit values stored.
While such common external events have a low probability to affect any particular integrated circuit memory storage device, in a system with large amounts of integrated circuit memory storage and/or systems which are required to have long duration up-times; such soft-errors have been found to occur several times per year, often causing service affecting problems. For example, telecommunications equipment is required to have both large memory stores and up-times measured in years. In a typical communications network employing a large number of cooperating interconnected interdependent telecommunications network nodes, the deleterious effect of a single soft-error experienced by a single network node will often affect the operation of multiple network nodes directly or indirectly connected thereto.
Without implying any limitations, by far the most common causes of soft-errors relate to naturally occurring radioactive discharge events and cosmic ray emissions. Alpha-particles, for example, have a limited penetration through matter and therefore soft-errors due to alpha-particle discharge events can be greatly limited by ensuring that the materials used in and about the integrated circuit memory storage device are radioactively inert. Cosmic rays however generate subatomic particle showers, specifically energetic neutron showers, which can penetrate matter to great depths. While shielding for all intents and purposes is only effective against alpha-particles; regardless of source, soft-errors are more likely to occur under improper cooling conditions as the electrons in the substrate of the integrated circuit memory storage device are more susceptible to being knocked off to higher orbital levels. The cost of shielding against alpha-particle discharges has to be balanced against the inseparable cost of cooling, as shielding also tends to prevent proper cooling. Nevertheless, soft-errors represent a continuing problem that needs to be addressed.
Techniques typically used in an attempt to mitigate memory errors include Error Correction Coding (ECC). Error correcting coding adds extra information to data bits in a fashion that allows corrections to be made to the data bits if one or two of combination of bits is changed. Typical error correcting codes provide for the correction of a single bit error and the detection dual bit errors, and require additional 8 bits to a group of memory cells used for storing a 64 bit long data word. Currently known ECC techniques cannot be used to address more than two bit errors.
ECC techniques are usually not implemented on large Synchronous Static Random Access Memories (SSRAM) employed in typical high-speed low-power applications, because of the already large size and increased cost of the SSRAM memory chips compared to less expensive and smaller Dynamic Random Access Memory (DRAM) chips. SSRAM is implemented using five to six gates per memory cell compared to a single gate for each DRAM memory cell, the additional ECC memory bits also employing the same number of gates. Therefore in SSRAM applications, soft-errors which could have been mitigated had ECC technology been employed, remain uncorrected.
Other current research and development relates to more sophisticated memory error detection techniques however implementing such techniques is subject to substantial development costs, substantial testing and validation overheads, and substantial operational overheads.
Other techniques typically used to detect bit errors include parity checking. Parity memory is used to detect memory bit errors. Each byte of data (typically 8 bits implemented as a group of 8 memory cells) is accompanied by a parity bit the value of which is determined by the number of ones (the number of memory cells in the logic high state) stored therein. Even/odd parity ensures that the total number of energized memory cells storing the data bits and parity bit is even/odd. Parity memory is most commonly used on microcomputers employing small word sizes. Typically, parity error check monitoring has, up to now, only been performed in hardware for entire memory storage devices, with no capability to pinpoint the exact location in the memory device of the affected memory cell. Parity checking techniques can be used to detect more than two memory bit errors.
Soft-errors manifest themselves as parity errors inevitably incurring large maintenance overheads. Until recently memory chips operated at high voltages and parity errors were associated with faulty hardware. Traditional approaches to addressing memory errors include:                hardware resetting or power-cycling system/equipment resulting in significant disruption to the availability of the system/equipment to perform its intended function and therefore a significant disruption to all provisioned services; and        employing memory storage devices which have ECC while incurring a high cost.Therefore, the typical mitigation of memory errors assume that all errors experienced by integrated circuit memory storage devices are hard errors requiring replacing the entire memory storage device.        
More recently, as memory storage device operational voltages have decreased, improved understanding of soft-errors has enabled other steps to be taken. The most relevant of these steps to the present description is a solution proposed by Cisco Systems, Inc., in a white paper, entitled “Increasing Network Availability,” which describes a process which scans for parity errors throughout memory storage devices without ECC. Additionally, the paper states that as a matter of standard practice, hardware components employing memory storage devices affected by the parity errors should be replaced on the second such single event upset. Without knowledge of the cause of the parity error, this practice results in unnecessary maintenance overheads, and possibly prolonged system downtime, which could be avoided in the case when the cause of an experienced parity error is a soft-error.
Prudent systems design calls for budgeting and employing integrated circuit memory storage devices larger than strictly required, mainly to delay systems obsolescence as systems are expected to undergo upgrades post deployment. The spare memory storage capacity employed exposes systems to a greater extent to soft-errors and therefore to greater maintenance overheads. Maintenance overheads for interconnected interdependent deployments compound, as the macro effects of soft-errors may only manifest themselves on equipment adjacent to the equipment employing the actual soft-error affected integrated circuit storage device.
As the importance and impact of soft-errors has just begun to be realized, further improvement in system/service availability has been found to be hampered by the occurrence of soft-errors, particularly affecting systems having large memories required to provide high reliability over prolonged periods of time. Therefore there is a need to mitigate the deleterious effects of soft-errors experienced by high-reliability systems employing large integrated circuit memory storage storing data/code for extended periods of time.