1. Technical Field
The present invention relates in general to the field of computers, and in particular to memory systems in computers. Still more particularly, the present invention relates to a method and system for capturing and counting transient voltage excursions in a Voltage Regulator Module (VRM) power supply to identify and isolate the cause of a fatal error in memory powered by the VRM.
2. Description of the Related Art
The complexity of high-speed memory board designs often means that system memory problems encountered in the field or at a customer site often cannot be easily or economically resolved on-site. A typical strategy employed by the Customer Engineer (CE) is to replace a complex and expensive Field Replaceable Unit (FRU) board in an attempt to solve the problem. The FRU board is typically a board that contains many smaller FRU components. The FRU board easily plugs into a backplane in the computer, and the FRU components plug into sockets on the FRU board.
Replacing the entire FRU board is usually overkill, since often only a single FRU component on the FRU board is defective. Still, technicians will replace the entire FRU board anyway because field diagnostic tests are often impractical.
A common problem affecting computer systems in the field is that involving a failure in the system memory. While hardware techniques such as the use of Error Correction Code (ECC) can correct many memory related errors, fatal (un-correctable) memory errors can still occur. For example, a fatal memory error can occur if voltage reflections produce poor signal quality on a line, or if there is poor mating between connectors on the FRU board, such as between a memory module and a socket.
Another hardware failure that could cause a fatal memory error is a transient Voltage Regulator Module (VRM) failure. A VRM is used to regulate the power supplied to memory modules, such as dual in-line memory modules (DIMM's), single in-line memory modules (SIMM's), and other forms of dynamic random access memory (DRAM) chips. Occasionally, a VRM will momentarily produce a voltage to the memory module(s) that is beyond (higher or lower than) the range that is acceptable to the memory components. During the period that such a deviation (i.e., “excursion”) from the allowable range occurs, the operation of the memory module may be disrupted, and it may be unable to receive or send back the correct data, thus resulting in a fatal (non-correctable) memory error.
There are many possible reasons for such an excursion of the VRM's output voltage (or current). For example, there may be a manufacturing or design defect in the VRM. Alternatively, the VRM selected may be inadequately sized for the task, since VRM selection is typically done during the design phase when insufficient information is available about the power requirements for as yet undetermined DIMM modules. The current needed by a DIMM may also be higher than anticipated due to unforeseen applications being run, bus width, signaling technology, input capacitance of the DRAM's, etc. The failure can therefore be either from a defect in the VRM itself, a mis-sizing of the VRM for present conditions, or some other condition that causes the output voltage of the VRM to stray from its desired operating value.
As described above, an unexpected momentary voltage or current excursion of the output of the VRM beyond the expected design limits can result in a failure in the memory module at some later point in time. The failure may be incorrect storage of data, or the failure may be garbled read data, with either failure caused by the voltage excursion. However, since the excursion is transient, it is nearly impossible to identify it as being the cause of the memory module's failure. Furthermore, very often only multiple excursions produce a fatal error in the memory module(s).
What is needed, therefore, is a method and system for capturing and counting transient voltage excursions from the VRM which fall outside the desired operating range of the VRM, and correlating these voltage excursions in time to when a memory occurs, in order to provide the CE with the necessary diagnostic information about which FRU component needs to be replaced (i.e., the VRM, the DIMM, etc.) when repairing a failed memory subsystem.