1. Field of the Invention
The present invention generally relates to data processing techniques and, in particular, to a system and method for efficiently recovering lost data values based on checksum values associated with the lost data values.
2. Related Art
Large computer systems (e.g., servers) often employ a plurality of memory units to provide enough instruction and data memory for various applications. Each memory unit has a large number of memory locations of one or more bits where data can be stored, and each memory location is associated with and identified by a particular memory address, referred to hereafter as a "memory unit address." When an instruction that stores data is executed, a bus address defined by the instruction is used to obtain a memory unit address, which identifies the memory location where the data is actually to be stored. In this regard, a mapper is often employed that maps or translates the bus address into a memory unit address having a different value than the bus address. There are various advantages associated with utilizing bus addresses that are mapped into different memory unit addresses.
For example, many computer applications are programmed such that the bus addresses are used consecutively. In other words, one of the bus addresses is selected as the bus address to be first used to store data. When a new bus address is to be utilized for the storage of data, the new bus address is obtained by incrementing the previously used bus address.
If consecutive bus addresses are mapped to memory unit addresses in the same memory unit, then inefficiencies may occur. In this regard, a finite amount of time is required to store and retrieve data from a memory unit. If two consecutive data stores occur to the same memory unit, then the second data store may have to wait until the first data store is complete before the second data store may occur. However, if the two consecutive data stores occur in different memory units, then the second data store may commence before the first data store is complete. To minimize memory latency and maximize memory bandwidth, consecutive bus addresses should access as many memory units as possible. This can also be described as maximizing the memory interleave.
As a result, the aforementioned mapper is often designed to map the bus addresses to the memory unit addresses such that each consecutive bus address is translated into a memory unit address in a different memory unit. For example, a bus address having a first value is mapped to a memory unit address identifying, a location in a first memory unit, and the bus address having the next highest value is mapped to a memory unit address identifying a location in a second memory unit. Therefore, it is likely that two consecutive data stores from a single computer application do not occur in the same memory unit. In other words, it is likely that consecutive data stores from a computer application are interleaved across the memory units.
Backup systems are often employed to enable the recovery of data in the event of a failure of one of the memory units. For example, U.S. Pat. No. 4,849,978, which is incorporated herein by reference, describes a checksum backup system that may be used to recover the data of a failed memory unit. To backup data stored within the memory units of a typical computer system, one of the memory units in the computer system is designated as a checksum memory unit. Each location in the checksum memory unit is correlated with locations in the other non-checksum memory units. During operation, a checksum value is maintained in each memory location of the checksum memory unit according to techniques that will be described in more detail hereinbelow. Each checksum value may be utilized to recover any of the non-checksum data values stored in any of the memory locations correlated with the checksum memory location that is storing the checksum value. The checksum value stored in a checksum memory location and each of the non-checksum values stored in a location correlated with the checksum memory location shall be collectively referred to herein as a "checksum set."
Each location in the checksum memory unit is initialized to zero. Each data value being stored in a location of one of the non-checksum memory units is exclusively ored with the data value previously stored in the location of the one non-checksum memory unit. In other words, the data value being stored via a data store operation is exclusively ored with the data value being overwritten via the same data store operation. The result of the exclusive or operation is then exclusively ored with the value, referred to as the "checksum," in the correlated address of the checksum memory unit. The result of the foregoing exclusive or operation is then stored in the foregoing address of the checksum memory unit as a new checksum value.
When a memory unit fails, the data value stored in a location of the failed memory unit can be recovered by exclusively oring the checksum in the correlated location of the checksum memory unit with each of the values in the other memory units that are stored in locations also correlated with the location of the checksum. The process of maintaining a checksum and of recovering a lost data value based on the checksum is generally well known in the art.
Unfortunately, during a recovery of a lost data value in a checksum set, most computer systems are prevented from writing data to the memory locations storing other data values in the same checksum set. In this regard, the writing of a data value to one of the memory units may create an update to the checksum value that is being utilized to recover the lost data value, and data inconsistent with the current checksum may, therefore, be used to rebuild the lost information. Unless additional steps can be taken to ensure that such an update does not result in inconsistent checksum information, an error in the data recovery process may occur.
Taking additional steps to prevent a data recovery error due to an update to the checksum being used in the data recovery can be complicated, if the computer system is allowed to continue data stores to the checksum set of the lost data value during the data recovery. Thus, to prevent an error in the data recovery process, most computer systems prohibit data writes to any memory locations storing a non-checksum data value of the checksum set once the data recovery process is initiated. When the data recovery process is completed, data writes to the foregoing memory locations are again enabled. However, the inability of the computer system to service write requests to the checksum set during the data recovery process reduces the overall efficiency of the computer system.
Thus, a heretofore unaddressed need exists in the industry for providing a system and method for recovering, from an inoperable memory unit of a computer system, a data value of a checksum set without requiring the computer system to temporarily refrain from servicing write requests that overwrite data values of the checksum set.