1. Field of the Invention
The present invention relates generally to the field of computer systems and, more particularly, to a structure and a method of tolerating data errors in multi-processor systems.
2. Background of the Related Art
Multi-processing computer systems generally include two or more processors which may be employed to perform computing tasks. A particular computing task may be performed upon one processor while the other processor(s) performs unrelated processing tasks. Alternatively, components of a particular task may be distributed among multiple processors to decrease the time required to perform the computing task. Generally speaking, a processor is a device configured to perform an operation upon one or more operands to produce a result. The operation is performed in response to an instruction executed by the processor.
Multi-processor computer systems with a single address-base and coherent caches offer a flexible and powerful computing environment. The single address-base and coherent caches together ease the problem of data partitioning and dynamic load balancing. The single address-base and coherent caches also provide better support for parallelizing compilers, standard operating systems, and multi-programming, thus enabling more flexible and effective use of the machine.
One structure for multi-processing computer systems is a distributed memory architecture. A distributed memory architecture typically includes multiple nodes each having one or more processors and a memory. The nodes are coupled to a network which allows communication between the nodes. When considered as a whole, the combined memory of all the nodes forms a xe2x80x9cshared memoryxe2x80x9d which can be accessed by each node. Typically, directories are used to identify which nodes have copies of data corresponding to a particular address. Coherency of the data is maintained by examination of the directories to determine the state of data.
Illustrative directory-based cache coherency architectures that have emerged include Cache-Coherent Non-Uniform Memory Access (CC-NUMA) and Cache-Only Memory Architecture (COMA), for example. Both CC-NUMA and COMA architectures have a distributed memory, a scalable interconnection network, and directory-based cache coherence. Distributed memory and scalable interconnection networks provide the required scalable memory bandwidth, while directory-based schemes provide cache coherence. In contrast to CC-NUMA architectures, COMA architectures convert a per-node main memory into a large secondary or tertiary cache, which is also called Attraction Memory (AM). The conversion occurs by adding tags to cache-line size partitions of data in the main memory. As a consequence, the location of a data item in the system is decoupled from the physical address of the data items, and the data item is automatically migrated or replicated in main memory depending on a memory reference pattern.
Unfortunately, in both COMA and NUMA architectures, data may be corrupted, resulting in errors in memory. Such errors occur because memory, as an electronic storage device, may return information that is different from what was originally stored. In general, two kinds of errors can typically occur in a memory system: repeatable (hard) errors and transient (soft) errors. A hard error is often the result of a hardware fault and is relatively easy to diagnose and correct because it is consistent and repeatable. A soft error occurs when a bit reads back the wrong value once, but subsequently functions correctly.
The only protection from memory errors is to use memory error detection or correction protocol. Some protocols can only detect errors in one bit of an eight-bit data byte while others can detect errors in more than one bit automatically. Other protocols can both detect and correct single and/or multi-bit memory problems.
Common error detection/correction mechanisms include parity, error correcting code (ECC), and the like. It is well known in the art to use parity and error correcting code (ECC) to validate the reliability of data transferred between a central processing unit (CPU) and a memory, programmed input/output (PIO) device, or the like. Further, ECC is used to recover from certain data errors in memory.
When parity checking is enabled, each time a byte is written to memory, a logic circuit called a parity generator/checker examines the byte and determines whether the data byte had an even or an odd number of ones. If it had an even number of ones, the ninth (parity) bit is set to a one, otherwise it is set to a zero. Thus, no matter how many bits were set to one in the original eight data bits, the nine bits together make up an odd number of ones. This mechanism is referred to as odd parity. When the data is read back from memory, the parity circuit acts as an error checker. It reads back all nine bits and determines again if there are an odd or an even number of ones. If there are an even number of ones, there is likely an error in one of the bits. When a parity error is detected, the parity circuit generates an interrupt, which instructs the processor to halt to ensure that the incorrect memory does not corrupt executing or executable processes.
Parity checking provides single-bit error detection, but does not correct memory errors. Further, parity checking merely determines the existence of an error without correcting the error. ECC not only detects both single-bit and multi-bit errors, but can also correct single-bit or multi-bit errors. ECC uses a special algorithm to encode information in a block of bits that contains sufficient detail to permit the recovery of a single or multi-bit error in the protected data. The correction of a single or multi-bit error is dependent upon the ECC algorithm used. When ECC detects an uncorrectable error, it generates an interrupt that instructs the system to shut down to avoid data corruption.
One problem with conventional error detection/correction mechanisms is that the frequency of system interrupts is still higher than desirable. Interrupts may cause a system or a processor reset, depending on the nature of the failure and the software capabilities of the system. Interrupts are undesirable because of the resulting system downtime, lost data and lost productivity.
Therefore, there remains a need for a structure and a technique for detecting errors while minimizing system interrupts. The system should be capable of detecting single or multiple bit errors and still avoid a system interrupt.
The present invention generally provides a method and system adapted to check data for errors and, if an error is detected, determine if a valid copy of the data is available within the system.
In one aspect of the invention, a method of transferring data in a directory-based data processing system is provided. The method comprises the steps of: accessing, by the requesting device, the data contained in a local memory associated with the requesting device, determining whether an error condition exists in the data; and if an error condition exists, requesting the data from a remote memory. In one embodiment, requesting the data from a remote memory comprises first accessing a directory to determine the state of the data. If the state indicates that the data is available at the remote memory, a request is placed on an interconnect coupling the requesting device and the remote memory.
In another aspect of the invention, a data processing system includes a first node having a first processor, a first memory and a directory, a second node having a second processor and a second memory and an interconnect coupling the first and second nodes. The data processing system is adapted to perform a process comprising: accessing, by the first processor, data contained in the first memory; determining whether an error condition exists in the data; and if an error condition exists, requesting the data from the second memory.
In yet another aspect of the invention, a directory-based data processing system having a distributed shared memory is provided. The data processing system comprises: a first node including at least a first processor, a first memory, a first memory controller and a first directory containing state data for one or more memory blocks of the first memory; a second node including at least a second processor, a second memory, a second memory controller and a second directory containing state data for one or more memory blocks of the second memory; and an interconnect connecting the first node and the second node. The first memory controller is configured to access data contained in the first memory and determine whether an error condition exists in the data and, if an error condition exists, place a request for the data on the interconnect.
These and other features and objects of this invention will be apparent to those skilled in the art from a review of the following detailed description and the accompanying drawings.