1. Field of the Invention
The invention relates generally to storage system cache memories subsystems and more specifically relates to methods and structure for maintaining integrity of data in a cache memory of a storage device despite intermittent failure of the memory subsystem during reset/initialization operation of the storage device.
2. Discussion of Related Art
Storage devices (e.g., disk controllers or storage controllers) typically include a large cache memory for storing recently accessed user data. The content of the cache memory may then be used to quickly complete subsequent read requests for data from the storage device (or storage subsystem). Given the desire for a large capacity cache memory, dynamic RAM memory components are typically utilized to provide lower cost high capacity cache memory. To enhance performance of the cache memory, double data rate (DDR) memory devices and controllers are typically employed (collectively referred to herein as a “memory subsystem”). In DDR memory subsystems, the DDR memory controller is initialized as part of start-of-day or reset processing. This initialization typically includes a “training” process in accordance with DDR memory standards. The training process enables the memory controller to test and configure various timing parameters to adjust for the signal timing requirements of the specific DDR memory devices that it controls.
Once the memory subsystem has been initialized (trained), the storage device can commence normal operations using the cache memory subsystem to store user data. On occasion, the storage device may detect a failure of the memory subsystem in attempting to train the memory subsystem or during other operation of the storage device. Sometimes the failures may be un-recoverable failures of the electronic circuits that comprise the memory subsystem. In such cases, no recovery is possible for the storage device and other data recovery techniques may be required to save data in the cache memory subsystem (e.g., redundant system configurations in which another storage device may take control replacing the failed device (e.g., “failover” in redundant systems often utilizing RAID control techniques).
However, there are some intermittent errors that may arise, for example, when the design of the storage device memory subsystem is operating very close to the limits of the specifications of the memory subsystem. For example, if the storage device is reset to recover from a storage access problem or power is lost to the storage controller while a battery backup retains the contents of the memory devices, the memory controller may be in an unusable or unstable state after the reset or power loss and thus may require re-training. Or for example, during normal operation the memory subsystem may indicate an error condition that is unexpected. In such cases it may be possible to reset the memory subsystem to eliminate the error condition. However, such a reset of the memory subsystem risks loss of user data presently stored in the cache memory subsystem. Such loss of data may be unacceptable in high-reliability storage applications.
Thus it is an ongoing challenge to attempt correction of some intermittent memory subsystem failures without risk of data loss.