This invention relates generally to computer memory systems, and more particularly to detection of a failing bus lane using syndrome analysis.
Contemporary high performance computing main memory systems are generally composed of one or more dynamic random access memory (DRAM) devices, which are connected to one or more processors via one or more memory control elements. Overall computer system performance is affected by each of the key elements of the computer structure, including the performance/structure of the processor(s), any memory cache(s), the input/output (I/O) subsystem(s), the efficiency of the memory control function(s), the main memory device(s), and the type and structure of the memory interconnect interface(s).
Extensive research and development efforts are invested by the industry, on an ongoing basis, to create improved and/or innovative solutions to maximizing overall system performance and density by improving the memory system/subsystem design and/or structure. High-availability systems present further challenges as related to overall system reliability due to customer expectations that new computer systems will markedly surpass existing systems in regard to mean-time-between-failure (MTBF), in addition to offering additional functions, increased performance, increased storage, lower operating costs, etc. Other frequent customer requirements further exacerbate the memory system design challenges, and include such items as ease of upgrade and reduced system environmental impact (such as space, power and cooling).
One approach to locating a failing lane in a bus, such as a memory system bus, is to use an error correcting code (ECC). An ECC can detect and correct a number of failing bits, but requires more redundant bits than an error detection code. Typically, an error detection code can detect an error but is not capable of fully resolving the physical nature of the error; for example, it may not be able to fully identify a failing lane for all possible error patterns in the failing lane. Therefore, an error detection code alone may not accurately isolate errors to specific failing lanes. Another approach to detecting a failing lane is lane shadowing, where a copy of data is sent on spare lanes. However, lane shadowing only operates on a subset of lanes at any point in time and can miss error events occurring outside of the analysis window for a given failing lane.