1. Field of the Invention
The present invention generally relates to data processing techniques and, in particular, to a system and method for performing backward error recovery (BER) in a computer.
2. RELATED ART
Large computer systems (e.g. servers) often employ a plurality of memory units to provide enough instruction and data memory for various applications. Each memory unit has a large number of memory locations of one or more bits where data can be stored, and each memory location is associated with and identified by a particular memory address, referred to hereafter as a “memory unit address.” When an instruction that stores data is executed, a bus address defined by the instruction is used to obtain a memory unit address, which identifies the memory location where the data is actually to be stored. In this regard, a mapper is often employed that maps or translates the bus address into a memory unit address having a different value than the bus address. There are various advantages associated with utilizing bus addresses that are mapped into different memory unit addresses.
For example, many computer applications are programmed such that the bus addresses are used consecutively In other words, one of the bus addresses is selected as the bus address to be first used to store data. When a new bus address is to be utilized for the storage of data, the new bus address is obtained by incrementing the previously used bus address.
If consecutive bus addresses are mapped to memory unit addresses in the same memory unit, then inefficiencies may occur. In this regard, a finite amount of time is required to store and retrieve data from a memory unit. If two consecutive data stores occur to the same memory unit, then the second data store may have to wait until the first data store is complete before the second data store may occur. However, if the two consecutive data stores occur in different memory units, then the second data store may commence before the first data store is complete. To minimize memory latency and maximize memory bandwidth, consecutive bus addresses should access as many memory units as possible. This can also be described as maximizing the memory interleave.
As a result, the aforementioned mapper is often designed to map the bus addresses to the memory unit addresses such that each consecutive bus address is translated into a memory unit address in a different memory unit. For example, a bus address having a first value is mapped to a memory unit address identifying a location in a first memory unit, and the bus address having the next highest value is mapped to a memory unit address identifying a location in a second memory unit. Therefore, it is likely that two consecutive data stores from a single computer application do not occur in the same memory unit. In other words, it is likely that consecutive data stores from a computer application are interleaved across the memory units.
Backup systems are often employed to enable the recovery of data in the event of a failure of one of the memory units. For example, U.S. Pat. No. 4,849,978, which is incorporated herein by reference, describes a checksum backup system that may be used to recover the data of a failed memory unit To backup data stored within the memory units of a typical computer system, one of the memory units in the computer system is designated as a checksum memory unit. Each location in the checksum memory unit is correlated with locations in the other non-checksum memory units During operation, a checksum value is maintained in each memory location of the checksum memory unit according to techniques that will be described in more detail hereinbelow. Each checksum value may be utilized to recover any of the non-checksum data values stored in any of the memory locations correlated with the checksum memory location that is storing the checksum value. The checksum value stored in a checksum memory location and each of the non-checksum values stored in a location correlated with the checksum memory location shall be collectively referred to herein as a “checksum set.”
Each location in the checksum memory unit is initialized to zero. Each data value being stored in a location of one of the non-checksum memory units is exclusively ored with the data value previously stored in the location of the one non-checksum memory unit. In other words, the data value being stored via a data store operation is exclusively ored with the data value being overwritten via the same data store operation. The result of the exclusive or operation is then exclusively ored with the value, referred to as the “checksum,” in the correlated address of the checksum memory unit. The result of the foregoing exclusive or operation is then stored in the foregoing address of the checksum memory unit as a new checksum value.
When a memory unit fails, the data value stored in a location of the failed memory unit can be recovered by exclusively oring the checksum in the correlated location of the checksum memory unit with each of the values in the other memory units that are stored in locations also correlated with the location of the checksum. The process of maintaining a checksum and of recovering a lost data value based on the checksum is generally well known in the art.
When a processor failure does occur, it is often desirable to return the computer system to a previously known state, which is assumed to be error free, and restart execution from this known state. The process for returning a computer system to a previously known state is often referred to backward error recover (BER). By performing BER after a processor failure, it can be ensured that any errors introduced by the failure are effectively eliminated.
BER is normally achieved by saving an additional copy of all memory values, including checksum values. This additional copy is stored in memory, referred to hereafter as “backup memory,” dedicated for storing the additional copy. The remainder of the computer system's memory shall be referred to hereafter as “main memory”.
Initially, the memory values stored in backup memory are identical to the memory values stored in main memory. This computer system state, which includes the same memory values in main memory and backup memory, is commonly referred to as a “checkpoint state.” As the computer system executes instructions, the data values in main memory are updated, and the data values written to main memory are also transmitted to a first in, first out (FIFO) device. If a processor fails during execution, the main memory can be returned to its checkpoint state by copying the memory values of the backup memory to main memory. Once the main memory is returned to its checkpoint state, the BER process is complete.
If the computer system executes instructions for a period of time without error and, therefore, without performing BER, then the checkpoint state can be updated such that the copy stored in backup memory represents a more recent state of the main memory. To achieve this, the data values in the FIFO device are used to update the values in the backup memory such that the values in the backup memory are identical to the values stored in main memory at a time later than the time of the original checkpoint state. The state of the backup memory then represents a later checkpoint state of the main memory. Thus, the backup memory can now be used to return the state of the computer system to the later checkpoint state in the event of a future processor failure. The backup memory can be periodically updated, as described above, such that the backup memory always represents a relatively recent state of main memory. As a result, the impact of performing BER can be minimized.
Unfortunately, performing BER in a system with reliable memory can introduce numerous problems and data errors related to multiple protection domains and memory failures. For example, if a memory system failure occurs during a BER, then it is possible that the checksum values will be inconsistent or, in other words, will not represent the correct checksums of the values being backed up by the checksum values. Due to complexities involved in protecting against errors that occur as a result of memory system failures during BER, most prior art systems do not implement measures to protect against such errors, thereby forcing these systems to duplicate data in expensive backup memory or leaving these systems vulnerable during BER processes.
Furthermore, when a BER process is performed, normally each value within main memory is returned to its checkpoint state. Such a methodology helps to keep checksum values consistent with the non-checksum values such that memory failures can be handled independently of processor failures. However, when the BER process occurs because of an error in a single protection domain, only the data values associated with the single protection domain should be returned to their checkpoint state. Returning the other values within main memory to their checkpoint state may introduce errors in the other protection domains that are part of the computer system.
Thus, a heretofore unaddressed need exists in the industry for providing an improved system and method for performing BER, particularly in systems with checksum memory and/or multiple protection domains. It is desirable for the system and method to operate efficiently and to avoid data errors that may occur when a memory failure occurs during the BER.