A primary concern of server computer memory performance is the throughput of the memory devices employed because oftentimes the large number of instructions received from networked terminals must be processed virtually simultaneously. The throughput of a memory device or “bank” is limited by the time necessary for a processor to access the data stored in the device. The average time necessary to access information from a memory device is determined by the physical limitations of the storage medium, operating system, processor speed and other factors. This time is known as the memory device cycle time. Multiple accesses are typically in progress to different memory devices at the same time. However, after a cycle has begun for a particular memory device, other functions, such as responding to an access request or retrieving requested data, may be blocked until the cycle is completed.
Typically, server computer memories are “interleaved” to prevent delays from multiple accesses to a particular memory device. Interleaving can be defined as choosing to select memory devices with address bits such that typical address streams are spread across multiple memory devices rather than a single one. An interleaved memory with n memory devices is said to be “n-way interleaved.” Each of the contiguous memory devices is mapped to a virtual address that may be interpreted by an operating system, address decoder or other system specific software.
Incoming memory instructions tend to be “local” in that they oftentimes call for accessing contiguous addresses in a memory array. For example, a first request may be for memory location N, a second at location N+1, a third at N+2, and so on, wherein N, N+1 and N+2 may be mapped to consecutive virtual address locations. For desktop computer memory controllers, it is typically acceptable to map N, N+1, N+2, etc. to consecutive locations on the same memory device as there may only be one or two request streams at a time. However, it is impractical to map consecutive memory locations to the same memory device in a server computer because the volume of requests being received from multiple terminals would likely result in access delays. Instead, interleaving provides for increasing the bandwidth capability of a server computer memory by allocating contiguous memory requests among multiple memory devices. As such, consecutive requests may be addressed to noncontiguous physical memory locations. Therefore, when one memory device receives a request and is opened to process the request, another memory device may receive the next request so that the first and second requests may be processed without undue delay. With enough memory devices, by the time each memory device receives an instruction, the first device should have completed its cycle time and should therefore be ready to process a new instruction.
Various advanced interleaving strategies exist for server computer applications. For example, an interleaved memory may be partitioned into separate regions so that, in the case of a memory device failure, damage may be isolated to a particular region rather than jeopardizing the operation of the entire memory. Typically, a first region may contain “pinned memory” addresses for fixed, “non-freeable” data, such as an operating system structure that must remain resident in main memory for a program to perform adequately. A second region may contain “freeable” memory addresses for data that can be written to an external location, such as to a disk.
As with all physical components, memory devices occasionally fail. There are many methods for predicting a memory device failure. For example, a memory device failure may be predicted by evidence of overheating or an abnormally high error rate. When a memory device is determined to be failing, software may request the removal of the failing memory device from the system memory map. When the memory device is removed from the memory map, the data from the failing memory device is copied to one or more replacement memory devices in a technique called “memory sparing.” After the memory data is copied to one or more operational memory devices, the address of the failing memory device is remapped to the new memory device(s) containing the data. Memory sparing can be performed without interrupting the normal operation of the memory devices, but typically requires additional unused (or “spare”) memory devices to be available for when a memory device failure is detected.
While it is possible to predict the failure of and spare a failing memory device generally, there are unique problems associated with sparing a device containing memory which may not be de-allocated, or “pinned memory”. First, as explained above, interleaved memory address locations are not contiguous. For example, address location N may be physically allocated to a memory device A, N+1 to a memory device B, N+2 to a memory device C, and so on. Therefore, each memory device may contain data that is essential to the operation of the interleaved memory as a whole. A failure of a given memory device can take out very large portions of, or possibly an entire memory address range. In another example, a particular physical memory device in an interleaved memory may contain every jth cache memory line, wherein j may be the number of memory devices in the interleaved memory. Therefore, if a particular memory device fails, the data it contains cannot be removed for a period of time and then replaced later on because the data contained in the remaining memory devices would be incomplete.
Interleaving multiples the range taken out by a single device failure to the point that it is very likely that a given device failure will impact pinned memory. Despite advances in failure prediction, memory removal/addition and sparing, there is currently no reliable method available for replacing a memory device which contains pinned memory. As such, without some method of dealing with pinned memory, techniques for sparing memory devices, especially techniques for sparing interleaved memory devices, have little value.