Most general-purpose digital computers provide a system for detecting and handling single-bit or multiple-bit parity errors. The occurrence of parity errors is not uncommon when data signals are being read from storage devices such as static random access memories (SRAMs) and dynamic random access memories (DRAMs). This is especially true when high-density memories are employed, as is generally the case in large data processing systems.
Many factors contribute to the occurrence of parity errors. Sources of contamination such as dust are proportionately increased in size relative to the dimensions of individual transistors employed within high density SRAMs and DRAMs, and are therefore more likely to cause latent defects resulting in parity errors. The presence of alpha particles can also cause parity errors. Alpha particles are randomly generated, positively charged nuclear particles originating from several sources, including cosmic rays that come from outer space and constantly bombard the earth, and from the decay of natural occurring radioisotopes like Radon, Thorium, and Uranium. Concrete buildings, lead based products such as solder, paint, ceramics, and some plastics are all well known alpha emitters. Smaller geometry storage devices can be adversely affected by the emission of alpha particles, causing a higher occurrence of parity errors.
In addition to the problems associated with alpha particles and other environmental contaminants, shrinking technology sizes contribute to the occurrence of parity errors. Manufacturing tolerances decrease as geometries decrease, making latent defects more likely to occur. This is particularly true when minimum feature sizes decrease below 0.5 microns.
Because of the potential for parity errors, common data processing systems must be able to efficiently detect and correct these errors. To minimize the impact on system performance, the recovery mechanism must be able to operate without system software intervention. If system software intervention is necessary, performance may suffer because additional references to memory may be required.
One approach to detecting and correcting parity errors relates to storing data signals with additional signals called “check bits”. The check bits and associated data signals form a code that can be used to detect and subsequently correct a parity error so that corrected data is returned to a requester. However, this method can add latency since an additional clock cycle may be required to generate the check bits, and another cycle may be required to perform the error detection and, if necessary, data correction.
Another mechanism for handling parity errors involves detecting the error using one or more parity bits. When a parity error is detected, an uncorrupted copy of the data is read from another memory and the corrupted data is discarded. For example, when corrupted data is obtained from a cache memory, an uncorrupted copy may be retrieved from the main memory. This type of approach does not require the use of error correction logic. However, the latency associated with retrieving data from the main memory following the occurrence of a parity error may delay processing activities and decrease system throughput.
The foregoing approaches address the problem of providing corrected data to a requester, but do not necessarily address the root cause of the error. In some cases, an error may occur because a transient, or “soft”, error occurred within memory. In this case, the memory location associated with the error may still be used, and is not any more likely than any other memory location to cause a future error. In this case, the system may continue to use the memory location. In other cases, a permanent, or “hard” error is present such that it is likely that any data stored to that memory location will be corrupted. In this latter case, it is desirable to discontinue use of the failing memory location.
Soft errors may generally be distinguished from hard errors by running simple memory tests. However, it is impractical to halt normal system operations and initiate memory tests following the occurrence of any parity error. Therefore, in some prior art systems, the occurrence of any parity error is handled by “degrading” the failing memory location. In other words, the failing location is marked as unusable and is no longer employed to store data signals. Some systems even degrade an entire block of memory locations when a parity error occurs to a location included within that block of memory. Once a memory location, or an entire memory block, is degraded, the degraded memory is no longer used to store data. This effectively reduces the size of the memory.
One mechanism for handling degraded memory involves tracking the number of degraded memory locations or blocks within a physical memory chip. If a large percentage of a particular memory chip has become degraded, the physical chip may be replaced. While this is viable for systems having discrete components, this solution is not workable in highly integrated systems.
Many systems integrate small, fast cache memories with instruction processor logic on a single Application-Specific Integrated Circuit (ASIC). As memory locations or memory blocks are degraded, the memory size of a smaller cache can be reduced to the point where system performance will suffer. This problem cannot readily be addressed by replacing components, since doing so would require replacing both cache and processor logic. This option is not economically feasible.
The above-described dilemma is becoming increasingly common as systems employ higher levels of integration, and as the decreasing technology sizes result in memory devices that are even more susceptible to parity errors. What is needed, therefore, is an improved system and method for handling parity errors in a manner that addresses the foregoing challenges.