The present invention relates generally to integrated circuit memory devices and, more particularly, to a synchronous dynamic random access memory (SDRAM) system.
Problems in state of the art memory systems have been because of intrinsic delays associated with memory read operations at high clock frequencies, and the increased write latency commensurate with increased read latencies, where non-zero latencies for read and write operations are the norm. In the case of xe2x80x98readxe2x80x99 operations, this data latency will be directly associated with the amount of time required to access the data from the sense amplifier latch or other intermediate storage locationxe2x80x94typically 3 or 4 clocks at a 266 MHz clock speed. In the case of xe2x80x98writexe2x80x99 operations, whereas it is still theoretically possible to provide data and address at the same time, data will typically be delayed several clocks after address, to improve command/address bus efficiency and reduce SDRAM powerxe2x80x94since read and write operations will generally be intermixed in the system command stream.
Due to the use of various forms of error correction code (ECC) now widely used on the data bus in server and workstations, most memory failures now result from causes other than traditional data corruption (soft and hard fails of the memory cell or supporting circuits). With the increased dependency on data storage in remote systems (databases, workrooms, department and company servers, and the internet in general), memory failures in server platforms are undergoing increased scrutiny in an attempt to minimize the time in which data or the entire system is unavailable due to hardware failures. Analysis of recent memory failure reports clearly points to the key contributors of memory-induced unplanned system outages as being related primarily to address, control, clock or related signals that do not include ECC coverage, and due to one or more of the following failure modes: connector/contact failures, memory controller or re-drive failures, high resistance solder joints, or the like. Since these signals are quite numerous, often passing through several levels of interconnects, and due to the general use of low-cost connectors, the interconnect systems are generally deemed as a significant contributor of memory failures in a well-architected ECC-protected system (representing 50% or more of total hard memory fails).
It is an object of the present invention to improve the overall memory system reliability without incurring additional latency.
Another object is to improve the detectability and correction of failures associated with interconnects.
One solution to improve the detectability and correction of failures associated with interconnects is to include error correction across all memory command and address signals, and correct any identified errors during valid operations to a given memory assembly. This approach can be very cost-effective, in that most high-reliability applications now utilize memory assemblies (modules) which include local command and address re-drive circuitry on the same carrier as the synchronous DRAM memory devices. To implement ECC across the command and address lines requires only that a few additional pins be added to the drivers, connectors and module re-drive circuitsxe2x80x94in conjunction with the ECC logic. Unfortunately, this method also increases the memory access time, as the ECC logic will result in one to two clocks of added command and address latency (depending on the clock period and logic circuit delays). Since memory failures of this type are generally rare, and due to the system emphasis on minimizing access delays during cache misses, simple use of ECC, as described, is not an ideal solution in many applications.
The present invention couples the addition of a new SDRAM operation, ideally suited for emerging devices with non-zero write data latency, with command and address ECC implemented in parallel with the normal memory re-drive method (to ensure minimum memory read and write latency). In summary, the key attributes of this invention are as follows:
1) ECC logic is added to the memory control and address paths, with the intention that single or even multiple bit failures, during valid command cycles, can be corrected to prevent a system outage. It is expected that most will implement this ECC function external to the synchronous memory devices, although the memory devices could include this function as well.
2) Memory commands and addresses are passed to the memory devices, with minimal insertion delay, as the ECC logic function occurs in a parallel path. Should an error be identified by this logic, the ECC correction circuitry will enable rapid recovery without permitting data corruption in the memory.
3) In systems produced using memory devices with the new xe2x80x98cancelxe2x80x99 command:
Any xe2x80x98readxe2x80x99 or xe2x80x98writexe2x80x99 operations that are initiated using command and address inputs subsequently found to be valid, will be executed normally, and without added delay or interruption.
Any xe2x80x98readxe2x80x99 operations initiated using command and address inputs subsequently found to be invalid, will generally have the xe2x80x98readxe2x80x99 data discarded, and the device returned to an idle state awaiting error recovery. For completeness, the present invention includes the concept of early termination of a xe2x80x98readxe2x80x99 operation, since longer bursts are expected in future devices, and a performance savings is possible when recovering from an error.
Any xe2x80x98writexe2x80x99 operations initiated using command and address inputs subsequently found to be invalid, would be followed by a xe2x80x98Command Cancel (CC)xe2x80x99 command, to the same memory bank(s), to prevent the data stored by the memory device(s) from being over-written erroneously. The command would generally return the device to a xe2x80x98standbyxe2x80x99 state, awaiting error recoveryxe2x80x94although other return states are possible, and would be covered by this invention.
In an ideal system implementation, the memory would be designed to permit continuous operation in a traditional ECC mode (in series with the command and address) or in parallel modexe2x80x94to ensure minimum latency.
4) System response to a memory command or address ECC error could include one or more of the following actions:
Re-try of failing operation while the ECC is operating in parallel mode, depending on the type of error identified. Some failures are intermittent in nature, and a repeat of the failure may be ideal to confirm the cause and/or determine the need for further action. This re-try would be completed with no change to ECC operating in parallel mode such as a xe2x80x9csoft error.xe2x80x9d
Execution of the cancelled operation with ECC being invoked in a serial mode which will add one or two clocks of latency. The memory controller or system re-drive logic would execute a new operation, correct single bit (or greater) errors depending on the ECC algorithm utilized and ensure valid operation.
The system would continue to operate with serial ECC enabled until a repair action occurred, or return to parallel ECC operation pending a repeated failure identification and recovery.
Various levels of reliability improvement can be implemented using this new command, depending on the amount of fault prevention desired when parallel command/address parity or ECC is invoked with the xe2x80x98command cancelxe2x80x99 function.
Some examples include:
To minimize the probability of issuing an illegal command that may result in significant recovery time, any critical signal(s) can include two separate contacts through each interconnect in the signal pathxe2x80x94thereby adding contact redundancy to minimize failure due to discrete high resistance contact failures.
Low cost systems might invoke only parity checking on these signals, with the Command Cancel (CC) function utilized only to identify and terminate operations prior to data corruption. Normal operation would terminate to the affected memory and a recovery mode could be implemented such as, multiple re-tries.
To reduce recovery time and/or to simplify controller logic, the xe2x80x98command cancelxe2x80x99 operation can be included in both xe2x80x98readxe2x80x99 and xe2x80x98writexe2x80x99 operations. This is viable only if the DRAM implementation of this function includes both modes.