1. Technical Field
The present invention relates generally to fault tolerant data processing systems and in particular to managing memory access operations in mirrored memory systems. More particularly, the present invention relates to a system and method for scheduling and processing read transactions in a mirrored memory system such as may be performed by memory access control logic.
2. Description of the Related Art
Computer failures can result from malfunctioning disk drives, memory or processors, conflicts between hardware components, software errors, and environmental interference among other things. Solutions for curbing the negative effects of such failures have included, for example, Predictive Failure Analysis (PFA), which provides autonomous monitoring of specified system parameters or failure conditions. PFA is commonly utilized in data storage or memory applications to predict and issue alerts warning of actual or imminent device failures. This allows a system administrator to either hot-swap the faulty component or schedule downtime at low-impact periods for the component to be fixed or replaced.
While PFA has provided substantial gains in preventing data loss and minimal runtime interruption for disk drive systems such as RAID systems, neither PFA nor other system failure warning or recovery techniques have adequately addressed data loss and system interruption caused by an actual memory data error. A solution directed to providing backup redundancy in the face of an actual data error resulting from a system failure or otherwise is generally known as mirroring. In disk mirroring, data is written to two duplicate disks simultaneously in disk drive systems such as RAID level-I systems. If one of the mirrored disk drives fails, the system switches to the other disk without any loss of data or service.
So-called memory mirroring is similar to disk mirroring to the extent it involves maintaining alternate copies of memory contents in two different regions of memory. Memory mirroring involves storing data to two different memory locations such that a backup copy is always available. Memory mirroring has become a key reliability feature for large scale server systems, such as the xSeries line of high performance servers from IBM Corporation. Fundamentally, memory mirroring operates such that responsive to detecting an uncorrectable data error, the second copy is accessed, thus avoiding loss of data and processing service similar to the disk mirroring scenario. A memory controller or equivalent device must be able to access the backup memory region when an error is detected in the first memory region. This type of access for retrieving a backup memory copy responsive to a detected error is commonly referred to as a mirror failover read.
Conventional mirrored memory architectures employ synchronization of two memory ports to accomplish memory mirroring. Such synchronization requires that each memory access request be issued to both ports of the respective mirrored memory regions simultaneously. Writes are issued to both ports, guaranteeing coherent memory. Read requests are also issued to both ports which return the data to a central data buffer simultaneously. If an uncorrectable error is detected for one of the reads, the corresponding port blocks the write enable to the central data buffer to prevent erroneous data from being accessible on the system bus. As a result, only the correct data or instructions (collectively referred to herein as data) are written to and accessible from the central data buffer. In this manner, conventional mirror failover read operations prevent system-wide failures that would otherwise result from uncorrectable memory errors.
The foregoing simultaneous dual access architecture is problematic in terms of sheer complexity. Both ports must be synchronized for each data access operation, even if the command is generated by separate memory controller entities such as scrub controllers which control scheduled testing, detecting, and reporting of memory errors.
Another problem with the foregoing conventional mirrored memory management is the dramatic reduction in available port bandwidth resulting from the using both ports for what is effectively a single memory access operation. Given the relative rarity of memory errors in such systems and that system memory access has traditionally been the greatest contributor to system latency, the impact of dual access on system bandwidth is particularly disadvantageous.
It can therefore by appreciated that a need exists for an improved system and method for managing mirrored memory access operations that maintains the system reliability aspects of backup memory while reducing the system bandwidth penalty associated with conventional mirrored memory systems. The present invention addresses this as well as other problems unaddressed by the prior art.