Field of the Invention
The present invention is generally directed to collecting information relating to errors encountered when reading data from memory. More specifically, the present invention collects information that may be used to improve the operation of memory in computers.
Description of the Related Art
The advent of double data rate fourth generation (DDR4) memory technology, a follow-on to the DDR3 technology, brings increased memory capacity and memory speed, while also lowering the operating voltage and most importantly shrinking the integrated circuit die size. The result of the pressures to operate circuits with such demanding circumstances has resulted in a notable increase in memory errors for memory intensive workloads. While most frequently these errors are correctable, they may be temporary (transient) or persistent errors. The current practice for managing a large installed base of dual in-line memory modules (DIMMs), which can reach into the many thousands in some data centers, is to analyze the details of corrected memory errors. Although sophisticated techniques are employed to determine if a dynamic random-access memory (DRAM) within a DIMM is experiencing a significant failure mode, the high incidence of transient failures (memory errors which are spurious, somewhat random, and not specifically repeatable) results in a large number of predictive failure events, which in turn results in a large number of DIMM replacements. In many instances, however, erring memories do not exhibit any persistent failures that might indicate that a portion of a transiently erring memory is permanently damaged, unusable, or unreliable.
As future generations of memory increases the amount of memory included within a single memory integrated circuit and as dimensional sizes associated with memory cells within a single memory integrated circuit reduce, computers using these memories will encounter an increased number of memory read back errors per unit time: i.e. as memory geometries reduce in size, memory error rates are expected to increase. Increased error rates may be associated with one or more types of events, such as defects in a memory cell, defects in wires connecting memory cells, cosmic rays hitting a memory cell, and radioactive particles impacting memory cells. In certain instances, transient errors may be caused by rise energy from once cell or row of cells leaking to adjacent cells or rows of cells, or they may be caused by cosmic rays or radiation impacting memory cells. This is exacerbated by a continual drive to reduce memory cell size and circuit size.
Some systems currently use error correcting memories that maintain error lists. These error lists store memory error information in small tables that record what DRAM memory locations have experienced an error at some point in time. In certain instances, error correcting memories also have the capability of reporting memory error correction events to a processor or to digital logic. Currently available error correcting memories, however, do not have the ability to distinguish between persistent memory errors versus transient memory errors. This is at least because the error correcting memories are not designed to identify whether an error is transient or persistent. This is also because tabulated data stored to track errors in memories currently do not include information relating the how frequently particular memory cells error. Instead, this currently available tabulated data is used to re-organize the memory such that erring memory locations are avoided. As such, currently available methods commonly re-organize memory to avoid using erring memory locations that are really still good. This is because memory cells that incur a transient error are frequently still functional. What is needed are systems and methods for identifying persistent memory errors as one type of memory error and transient memory errors as a different type of memory error, where only persistent errors cause portions of memory not to be used.