The present invention relates to computer systems, and more specifically, to rolling back sub-optimal changes in a computer system.
Many advanced computer systems have reliability/availability/serviceability (RAS) features that enable the continued operation of the computer system even under adverse failure conditions. One such RAS feature of memory modules, such as a dual in-line memory module (DIMM), is known as lane sparing. If boot firmware detects an error in a communication path within a DIMM, the firmware can configure the DIMM to instead use a spare path, which ensures continued reliable operation. When the firmware makes such a repair, the firmware stores the repair information into the affected DIMM's serial presence data (SPD), which is a small non-volatile memory residing on the DIMM. On subsequent system boots, the boot firmware consults the SPD to determine which lanes have been marked as bad, so as to program the memory controller to correctly communicate with the remaining good lanes on the DIMM. Thus, the DIMM SPD acts as a cache so that diagnostics can be performed only once, to avoid the time penalty of performing these diagnostics every time the system is booted.
If the boot firmware applies too many repairs to a given DIMM over time, the supply of repair paths will be exhausted. When the supply of repair paths is exhausted, the DIMM will become unusable and need to be replaced.
Assuming all DIMM lane sparing algorithms are legitimate, this RAS feature can increase the mean time to failure of the memory subsystem. However, sometimes firmware may be released with a sub-optimal diagnostics routine that falsely detects DIMM communication path faults and applies lane sparing when it is not needed, which shortens the useful lifespan of the DIMM. This results in expensive DIMM triage, service interruptions, physical replacements and the like. In typical systems, there are no easily accessible options to clear the repair information outside the manufacturing line. Further, that ability is generally not given to customers because they may inadvertently remove data that is needed for analysis.