This invention relates generally to computer memory, and more particularly to providing error correction and detection in a memory system.
Contemporary high performance computing main memory systems utilize error correction codes (ECCs) to detect and correct occasional, random bit errors.
Historically, for memory devices up through double data rate three (DDR3), most memory errors in a memory subsystem or system could be classified as being related to one or more of memory cell faults (generally affecting a single bit), memory core errors (e.g. word line or bit line errors affecting multiple bits, etc.), I/O errors (affecting one I/O of a device), “chip kills” (affecting all I/Os of a device) or other faults due to interconnect or interface device faults, etc. Bit error rates due solely to I/O transfer failure (e.g. the inability to pass data between the memory device and the device to which the memory device is connected) was almost zero in a properly designed and tuned (e.g. proper driver strengths, terminations, wiring topologies, etc) system.
With the increased data rates that are expected with emerging technologies (such as DDR4 and beyond), the very high data rates are expected to result in a dramatic increase in failure rates due to the inability of the memory device to accurately communicate with the device(s) (e.g., a memory interface device or “MID”) to which it is attached, given the reduced timing margins present at the high data rates. These new errors will be caused by several factors such as clock jitter, inter symbol interference (ISI), cross-talk (between adjacent and otherwise nearby lines), etc. Of those listed, clock jitter is expected to be a major component in the increased failure rates, comprising almost half of the total failure rate.
The referenced clock jitter simultaneously affects a multitude of devices which may be transferring data at the same time. As such, the bit error rate (BER) of the devices that are simultaneously transferring data has a much higher correlation than traditional memory device faults (cell, core, etc.), which are generally random in nature. For this reason, the existence of a communication interface fault, at a given point in time in a high speed memory interface, suggests that there is a high probability that one or more other pins that are simultaneously switching may also have an error at the same time. Further compounding the situation is the likelihood that the number of errors (e.g. including at least three or more independent errors) will exceed the capability of contemporary/available ECC schemes which are developed to detect random errors that have almost zero correlation between different pins or different devices.
One method of addressing this concern is through the use of a cyclic redundancy check (CRC) coding, however, this solution requires additional pins (adding cost overhead) and may dramatically affect overall memory performance due to the need to initiate one or more re-try operations until a successful transfer is completed. Unlike ECC, a CRC method does not result in real-time error correction. CRC results in only error detection.
Therefore, it is highly desirable to have an ECC coding structure and method to maximize coverage for multiple bit errors, such as those that will be present with the future high speed memory device interfaces, such that performance and reliability can be maximized with minimal pincount overhead.