The applications entitled xe2x80x9cSelf-Healing Memory System for High Availability Serverxe2x80x9d, identified by HP Docket Number 10991629 and the inventor Michael B. Raynham and James G. Mathios, filed Oct. 28, 1999 and xe2x80x9cRadial Arm Memory Bus for a High Availability Computer Systemxe2x80x9d identified by HP Docket Number 10991678 and the inventors Michael B. Raynham and Hans Wiggers, filed Oct. 28, 1999 include subject matter related to the copending application.
Available memory systems are constantly expanding in size with time, with current server memory systems often being in the range of up to 64 Gbytes (approximately half a trillion storage bits) or larger for high end servers. Dependent on the customer requirements, the customer may choose to purchase a low end or high end server. Current low end server systems typically include one to four dual in line memory modules (DIMMs) while a high end servers typically include four or more DMMs. Typically, high end servers also include high availability features such as memory chip redundancy, hot swapping, and the ability to do chipkill error correction.
Referring to FIG. 1A shows a side view of a memory system for a low end server system 100. The memory system includes a CPU or memory controller 102 affixed to a motherboard 106 and two dual in line memory modules 108a and 108b. The two memory modules 108a-b shown each includes N memory devices 112 connected in parallel. Assuming for purposes of discussion that N is equal to eighteen, the eighteen memory devices 112a-N on each memory module 108a and 108b are connected to the memory controller 102 by a data bus 114, which includes board trace portions 116, a connectors 118 and a module trace portions 120.
FIG. 1B shows a block diagram of the memory structure of the memory modules of the low end server shown in FIG. 1A. In the embodiment shown in FIG. 1B, the data bus is 72 bits wide where 64 bits are used for data and 8 bits are used for error correction. Each of the eighteen memory devices on the memory module 108a-b is 4 bits wide and for a 256 Mbyte system each of the eighteenxc3x974 SDRAMs is 32 Mbyte. The eighteenxc3x974 memory devices are connected in parallel so that for each memory operation, the output onto the data bus 114 is 72 bits wide.
Referring to FIG. 1C shows a clock pulse for reading or writing to a memory location of the low end server shown in FIG. 1A. The memory controller reads a single word or memory location from a single memory module at a time. Assuming a single data rate (SDR) system and a read operation, the memory location in memory module 108 having the address 000000 is read at the clock edge t1. The contents of the memory location is 72 bits wide. No memory operation occurs at clock edge t2. A second memory location having the address location 000001 in memory module 108 is read at the clock edge t3.
FIG. 2A shows a side view of a memory system for a conventional high end server system 200 having eight DIMM modules. Similar to the low end server configuration shown In FIG. 1A, the memory system shown in FIG. 2A includes a CPU or memory controller 202 affixed to a motherboard 206. However, the high end configuration Includes eight dual in line memory modules 208a-h instead of the two DIMMs 108a-b shown in FIG. 1A. The eight memory modules 208a shown each includes N memory devices 212, The memory controller 202 is connected to the eight memory modules 208a-h by a data bus 214, which includes board trace portions 216, connectors 218 and module trace portions 220.
FIG. 2B shows a block diagram of the memory structure of the high end server shown in FIG. 2A. In the high end server shown, the data bus is 144 bits wide where 128 bits are used for data and 16 bits are used for error correction. Preferably each memory module Includes eighteen memory devices (N=18), each memory device being 4 bits wide. For each memory module, the eighteenxc3x974 memory devices are connected in parallel. Data is read from two memory modules simultaneously, so that for each memory operation, the output onto the data bus 214 is 144 bits wide.
Referring to FIG. 2C shows a clock pulse for reading or writing to a memory location of the high end server shown in FIG. 2A. The memory controller reads a single word or memory location from a single memory module at a time. Assuming a double data rate (DDR) system and a memory read operation, the memory location in memory module 208a having the address 000000 and the memory location in memory module 208e having the address 000000 are both read simultaneously at the clock edge t1. A second memory location in memory module 208a having the address location 000001 and a second memory location having the address 000001 in memory module 208d are both read simultaneously at the clock edge t2.
The current trend of increasing memory size is likely to continue. Microprocessor suppliers continue to supply higher speed CPUs. With increases in CPU speed come increased speed in the CPU bus and supporting I/O systems and a corresponding increase in server memory size per CPU since more users per CPU can be supported. As the size of memory systems increases, the probability of a memory bit failing, and thus the memory system failing, increases. Customers are demanding improved error correction features to deal with these increases in memory failures even for low end systems.
One error correction feature that was traditionally not supported in low servers is what is known in the industry as chipkill. The term chipkill traditionally refers to the ability to correct multiple bit errors in memory, where the multiple bit error is the width of the memory device. For example, for a 32 Mbit SDRAM that is 4 bits wide, a system that supports the chipkill function would be able to correct a 4 bit wide error in the memory device. Thus, the failure of an entire SDRAM chip organized in axc3x974 configuration in a system that supports chipkill would not cause the system to fail.
Chipkill is provided in high end chipsets, for example, by combining two DIMMs into a 144 bit bus that includes 128 data bits and 16 ECC bits where ECC stands for error correcting or error checking and correcting codes. The number of bits that can be corrected, typically depends on the number of ECC bits supported by the system. ECC or error correction code refers to a commonly used error detection and correction process that is typically based on a CRC (cyclic redundancy code) algorithm. CRC algorithms work so that when data is received, the complete data sequence (which includes CRC bits appended to the end of the data field) are read by a CRC checker. The complete data sequence should be exactly divisible by a CRC polynomial. If the complete data sequence is not divisible by a CRC polynomial, an error is deemed to have occurred.
Supporting the chipkill function based on an ECC process typically requires additional error correction bits where the number of bits corrected depends on the number of ECC bits supported by the system. For example, typically the CRC algorithm used to correct for a 4 bit wide memory organization requires more than the eight error correction bits that are provided by the low end server shown in FIG. 1A. Thus, to perform the chipkill function for axc3x974 organization, the low end server would require additional memory modules or devices to provide the additional ECC bits necessary to perform the required CRC algorithms. However, because the CRC algorithm typically used to correct for a 4 bit wide memory organization does not require more than the 16 error correction bits supported by the high end server shown in FIG. 2A, the high end server could support the chipkill function.
An alternative implementation available to low end server systems that wish to provide chip kill error correction is to provide a custom ASIC, such as that currently made commercially available from IBM Corporation, that provides specialized logic. The custom ASIC uses a specialized controller, and a memory wider than 72 bits, i.e. 80 bits to provide 16 ECC bits and uses fast SDRAM. This is twice the number of ECC memory normally required which limits the maximum memory size, since additional memory is required for ECC memory. Although the custom ASIC provides chipkill functionality, it is comparatively expensive and it would be preferable to use standard components if available. Further, currently available ASICs have limited memory sizes available and may not work at the 133 MHz speed required for next generation bus systems.
A memory system configuration and process that provides chipkill error detection in a low end server system that does not require costly custom ASIC chips or additional memory modules or devices to provide the additional ECC bits necessary to perform the required CRC algorithms is needed.
The present invention provides a process and memory configuration for providing chipkill error detection in a low end server without requiring non-standard components or additional memory devices. Typically for semiconductor memory devices, error detection and correction is performed using ECC error correction. To perform ECC correction, a predetermined number of bits in a specified memory device must be available for error correction. Typically, the number of error correction bits required for chipkill functionality is dependent on the width of the memory chip to be corrected and the CRC algorithm that is used. If the required number of error correction bits is not provided by the memory system, another option according to the present invention, is to perform doubleword operations for consecutive memory locations in a single DIMM module. Reading the memory module in double words basically provides the system with double its previous number of available ECC bits per data transfer. Since these low end systems would typically not have the number of error correction bits necessary to do the chipkill operation, this double word technique allows chipkill to be supported in low end systems which would normally not be able to perform the chipkill operation.
Using doubleword read and/or writes in combination with the described memory configuration allows the chipkill operation to be performed with minimal system performance impact and without increasing system costs for the cost sensitive low end server or workstation applications. The implementation according to the present invention does not require a custom memory module and is not limited certain memory module sizes as are today""s commercially available custom ASICs.
For server applications, the double word implementation uses the DDR extra bandwidth that would otherwise be wasted, to xe2x80x9chidexe2x80x9d the overhead for the chipkill operation. When used in the low end server applications, the performance penalty of reading double words (16 B data, 2 B ECC) is minimized by the bandwidth of the memory system being two times the system bus bandwidth. In a PC or workstation, this bandwidth is used for non-cache coherent AGP accesses to memory, not applicable in a server. Therefore the overhead for R-M-W used in byte write operation is hidden and bandwidth that would otherwise be wasted is used. Further, the present invention can provides the same chipkill functionality that it provides in a server system in a PC chipset with minor memory controller changes without requiring additional parts or custom parts.
In determining whether to implement the doubleword read, the system user implementing the system must first determine whether or not chipkill is a desired function. If the chipkill function is desired there should be a determination if the number of error correction bits need to support the target memory width of the chip is already provided by the error correction bits in a single memory address. If the number of error bits required is already provided, there is no need to a doubleword operation. Otherwise, the memory locations in a single memory module can be read or written in double words. Reading the memory module in double words basically provides the system with double its previous number of ECC per data transfer. This increased number of error correction bits may be adequate to perform a chip kill operation for the targeted chip width.