A typical computer system includes a processor for executing software instructions to perform desired calculations or tasks, such as word processing, data entry and management via spreadsheets or a database, Internet access, and so on. The processor communicates with a variety of other components in the computer system to control the overall operation of the system. More specifically, the processor communicates with these other components through a chipset, which is a group of integrated circuits typically including several controllers that collectively allow the processor to access and control components in the computer system. For example, the processor accesses a memory subsystem through controllers in the chipset, and the same is true of accessing and retrieving data from mass storage devices like hard disks. For ease of description, in the following discussion the chipset will be referred to as a single component although the chipset may include one or more controllers and possibly other integrated circuits as well.
In a computer system, the processor stores data along with programming instructions for programs that are currently being executed in dynamic random access memory (DRAM) in the memory subsystem. The term data is used to include any type of information stored in memory, and thus includes data being operated on or generated by a program as well as programming instructions. Any errors in data written into DRAM and data read from DRAM may adversely affect the operation of the computer system or the results generated by a particular program being executed. Such memory errors are particularly problematic in computer systems that must be available around the clock, such as a Web server that delivers Web pages to client computer systems requesting such pages. A Web server must be operational around the clock in a commercial environment where the server contains Web pages that collectively form a company's commercial Web site. The Web site may be accessed by customers any time of day and thus any memory errors that necessitate making the server and thus the Web site unavailable may be extremely costly and must be quickly diagnosed to minimize server/site downtime, and avoided altogether if possible.
The chipset typically includes error correcting code (ECC) circuitry that detects and corrects certain types of memory errors. When the chipset detects any such memory errors, the chipset typically stores information related to the detected error in special error registers in the chipset. For example, the Hewlett-Packard zx1 chipset stores memory error data in three registers: 1) an address register that stores a physical memory address of the data containing the error; 2) a first syndrome register that stores error syndrome data regarding the detected error; and 3) a second syndrome register that also stores error syndrome data regarding the detected error. The physical memory address is the address supplied by the processor to the chipset to access corresponding data in the memory, and the error syndrome data indicates the location of the erroneous bit or bits in the data corresponding to the physical memory address.
An error diagnostic program running on the server typically retrieves the error data stored in these registers in the chipset and reports the detected errors to an administrator responsible for maintaining the server. The administrator then typically manually decodes the detected memory errors utilizing the error data to determine the precise location of the error within the memory. The administrator must do this because the physical memory address stored in the address register does not indicate the specific defective component in the memory, but merely an address that corresponds to some unknown physical components. More specifically, the memory subsystem includes a plurality of DRAM memory modules each including a plurality of individual DRAM devices. To access data in the memory, the chipset must map or translate the physical memory address from the processor into a memory bus address understood by the DRAM devices. Configuration registers in the chipset contain information regarding the specific types of DRAM modules and devices coupled to the chipset, and the chipset utilizes this configuration data to translate the physical memory addresses into corresponding memory bus addresses. The memory bus address includes rank, bank, row, and column components generated by the chipset in response to the applied physical memory address. A rank corresponds to DRAM devices coupled to a common chip select signal, where the chip select signal is a signal that must be activated to access the device. The bank, row, and column components correspond to particular data within each DRAM device in the rank being accessed, with the data being organized in individual banks containing a plurality of rows and columns of memory cells that physically store the data.
Although the chipset must translate a physical memory address to a memory bus address to access data in the DRAM memory, in many chipsets, like the Hewlett-Packard zx1 chipset, only the physical memory address is stored in the error registers. As a result, the system administrator must manually translate the physical memory address into the memory bus address and analyze the error syndrome data to identify the precise module, device, and memory cells within the device that correspond to the erroneous data. This is a time consuming and thus expensive process. While the administrator could write a custom program to automatically perform this translation, individuals other than the administrator whom wrote the program may not know about the program or may not know how to use the program due to lack of documentation. Similarly, if the configuration of the memory changes then the custom program must also be updated to provide accurate results. Duplication of effort may also result if multiple servers contain identical memory configurations.
There is a need for a system and method for decoding detected memory errors in computer systems that eliminates the need to manually decode such errors while also being easily kept current to account for evolving memory subsystem designs and which is accessible by numerous administrators to eliminate duplication of effort.