The invention relates to a fault tolerant memory.
Referring to FIG. 1, to service its storage needs, a file server 2 typically requires several (e.g., 5) disk drives 4. To improve the rate at which a file server 2 stores and retrieves data from disk drives 4, quite often, the drives 4 are arranged in a data storage array so that chunks of the data are spread across all of the drives 4. Thus, when organized as an array, the drives 4 effectively become one logical disk drive. In this manner, all reads and writes of data simultaneously use all of the drives 4. As a result, data transfer time between the server 2 and the disk drives 4 is reduced.
Unfortunately, the additional performance gained from the array of drives 4 does not come without a cost. Each additional drive added to the server 2 increases the likelihood that one of the drives 4 will fail and data will be lost. To improve the data storage integrity of the drives 4, data is quite often stored on the drives 4 in a fault tolerant scheme so that data from a failed drive can be recovered by using data from the others drives of the array.
Several common fault tolerant schemes that are used with drive arrays fall into a general drive array organizational scheme known as a Redundant Array of Inexpensive Disks (RAID). A particular drive array might be organized under one of several RAID types, or RAID levels. In a level one RAID array, data is mirrored, or shadowed, in redundant drives. Thus, for example, to store data in the array, the data is stored across two of the drives 4, and, simultaneously, a copy of this data is stored across two other drives 4 that serve as mirror drives. Thus, the data on the mirror drives duplicates the data on the main drives, and if one of the drives should fail, the data on the failed drive can be recovered from the duplicate drive.
A drive array may use a parity-based RAID scheme. In this scheme, the data is distributed in chunks across the array along with a parity chunk that is generated from the data. This parity chunk is used to recover data should a drive of the array fail.
For example, referring to FIG. 2, in a level four RAID array, data is distributed in a stripe 6 across the array. In this manner, each stripe 6 is subdivided into five chunks, or blocks 7, and each block 7 is stored on one or more drives 4 of the array. Thus, a level four RAID array requires at least five disk drives 4. One of the blocks 7 (e.g., block 7a) is a parity block. This parity block is essentially the additional, redundant data used as an aid in recovering data from a failed drive 4. In a level four RAID array, the parity block 7 is stored on one or more parity dedicated drives 4 of the array. In the most basic level four RAID scheme, each of the different blocks 7 is stored on one of five of the different drives 4.
The parity data stored in the parity block 7a is derived from the other blocks 7b, 7c, 7d and 7e. The parity block 7a typically is formed by includes performing an exclusive OR (XOR) of the blocks 7b, 7c, 7d and 7e. As a result of this scheme, if any one of the blocks 7a, 7b, 7c, 7d and 7e is lost due to a failed drive, the block may be recovered by XORing the remaining blocks together.
The server must update the parity block 7a each time the server 2 updates a block 7 of the stripe 6. However, the server 2 does not have to read all five blocks 7 (i.e., one stripe 6) from the array to recalculate the parity block 7a each time one of the blocks 7 is updated. Instead, to write one block to the array, the server 2 only needs to read the block 7 to be updated (i.e., the old data block) and read the parity block 7a (i.e., the old parity block). Due to the nature of the XOR arithmetic, the new parity block 7a is calculated by XORing the old parity block with both the old data block and the new block of data to be written to the array.
Other RAID levels are also commonly used. These levels include level five RAID which is similar to level four RAID, except that the parity blocks are distributed across all five drives, instead of just being located on one dedicated drive.
In general, in one aspect, the invention features a computer system that includes memory units, a central processing unit and a memory controller. The memory controller is configured to access the memory units in response to interaction with the central processing unit and define a fault tolerant memory array with the memory units. Each memory unit stores first data represented by second data stored by the other memory units.
Advantages of the invention may include one or more of the following. Multiple memory failures are tolerated. Memory uncorrectable by ECC code can be corrected. System crashes are prevented. Memory modules may be xe2x80x9chot swapped.xe2x80x9d Memory down times are reduced. Memory may be expanded or reduced while the system remains powered up.
Implementations of the invention may include one or more of the following. The first data may include a mirror image of the second data. The fault tolerant array may include a RAID-type array. The controller may also be configured to use the second data to restore the first data. This restoration may include exclusive ORing the second data together. The first and second data may include a stripe of data that is stored in all of the memory units (e.g., memory modules such as DIMMs, SIMMs or RIMMs). The stripe may have contiguous blocks of data, and the first data may include a portion of one of the blocks.
In general, in another aspect, the invention features a computer system that includes a central processing unit, memory chips and a memory controller. The central processing unit is configured to generate data. The memory controller is configured to store the data in the memory chips, define a fault tolerant array with the memory chips, and use the fault tolerant array to restore the data.
Implementations of the invention may include one or more of the following. The memory controller may also be configured to generate correction code for correcting corruption in the data, store the correction code in the memory chips, use the correction code to correct corruption in the data when the corruption is correctable, and use the fault tolerant array to restore the data and corruption code when the corruption is not correctable.
In general, in another aspect, the invention features a computer system that includes a central processing unit, a slot connector and memory. The memory includes a first memory module that is inserted in the slot connector, and the first memory module stores data. The memory controller is configured to use the memory to restore the data on a second memory module that replaces the first memory module.
Implementations of the invention may include one or more of the following. The memory may include additional memory modules, and the memory controller may also be configured to define a fault tolerant array with the memory modules. The slot connector may include a hot plug slot connector.
In general, in another aspect, the invention features a method for use with a computer system that has memory units. The method includes defining a fault tolerant memory array with the memory units. Each memory unit stores first data represented by second data stored by the other memory units.
In general, in another aspect, the invention features a method for use with a computer system that has memory chips. The method includes storing data in the memory chips, defining a fault tolerant array with the memory chips, and using the fault tolerant array to restore the data.
In general, in another aspect, the invention features a method for use with a computer system that has memory. The memory includes a first memory module that is inserted in the slot connector. The first memory module stores data. The method includes using the memory to restore the data on a second memory module that replaces the first memory module.
Other advantages and features will become apparent from the following description and from the claims.