1. Field of the Invention
The present invention relates to an apparatus and method for controlling access to a memory device, and in particular for controlling access to a memory device where error correction codes are added to data to be stored in the memory device in order to provide error correcting capabilities when the data is subsequently retrieved from the memory device.
2. Description of the Prior Art
It is known to use error correction codes (ECC) in order to protect a data packet from various forms of data corruption. Typically, this is achieved by treating the data packet as a series of data symbols of fixed length, and then adding a number of ECC symbols so that the data symbols and ECC symbols collectively form a code word. Using such a technique, if m ECC symbols are added when forming the code word, then up to m/2 randomly located symbol errors can be located and corrected within the code word. There are various known ECC coding techniques for generating the symbols of the code word. For example, one technique uses Reed Solomon codes, these codes being based on Galois field mathematics and having properties which make them suitable for hardware implementation.
One practical application for such an ECC coding technique is in memory devices, for example memory devices using DRAM (Dynamic Random Access Memory). One known arrangement of such a memory device involves providing a number of Dual Inline Memory Modules (DIMMs), where each DIMM consists of a number of DRAM chips on a circuit board, including at least one chip reserved for storing ECC information. Often, such a memory device is accessed via burst access operations, each burst comprising a plurality of beats, and the DRAM chips of the DIMM being accessed during each beat. In such an arrangement, it is known to treat the entirety of the data to be written to the memory device via a burst write access as forming the data packet, with a plurality of ECC codes then being generated to add to that data packet in order to form the code word. As mentioned earlier, if the code word includes m ECC symbols, then up to m/2 randomly located symbol errors can be corrected when the data is subsequently read from the memory via a burst read access.
There are various applications where data stored in the memory device may be subjected to such ECC coding techniques. One particular example is in Reliability, Availability, Serviceability (RAS) critical applications such as data server applications, where the use of such techniques provides greater reliability and availability of service.
When the memory device is arranged as discussed earlier by employing a number of DIMMs, it is easy to replace any one of the DIMMs in the event of a failure. In particular, if one or more individual DRAM chips within a DIMM fail, then that can be notified to an entity responsible for managing the memory device, and the relevant DIMM can be replaced. Accordingly, current ECC coding techniques are targeted at maintaining service until a failed module can be replaced.
However, such memory devices cannot always be deployed in convenient locations, and accordingly there can be some delay in replacing a failed DIMM. During such time, it would be preferable for the memory device to continue to be operational. Whilst one known way to achieve this is to provide one or more redundant blocks of memory, which can be switched in in the event of a failure, this obviously increases the size and cost of the memory device, and is not appropriate in many applications.
Furthermore, it is increasingly the case that low cost, low power, servers are being built with solder/down memory parts. Unlike the above arrangement that uses replaceable DIMM modules, once such a memory device is assembled, it cannot be maintained in a similar fashion, and accordingly once sufficient memory failures have accumulated past the capability of the ECC protection scheme, the memory device is rendered unusable. It would accordingly be desirable to prolong the usability of such memory devices.
The paper “Virtualised ECC: Flexible Reliability in Main Memory”, by Doe Hyun Yoon et al, Micro, IEEE, Volume 31, Issue 1, pages 11-19 (Digital Object identifier 10.1109/MM.2010.103) describes a system in which an operating system may decide, when allocating a portion of main memory to a particular application, how to apportion that allocated memory portion between the storage of data and the storage of related ECC information, with the goal of maintaining a constant error protection rate without requiring dedicated memory area for ECC storage.