Servers are used in a wide variety of different computing applications. A scalable server is one that can grow to a potentially large number of computing, input/output (I/O) and memory elements. The most extreme examples are supercomputer clusters, which are growing toward 100K processors, and millions of dynamic random access memory (DRAM) devices.
For large scale systems such as supercomputing clusters, the soft and hard error rates can have a significant impact on efficiency and usability. As is known, by way of example, a soft error is an error occurrence in a computer's memory system that changes a data value or an instruction in a program. A soft error will not typically damage a system's hardware. The only damage is typically to the data that is being processed. As is also known, by way of example, a hard error is an error occurrence in a computer system that is caused by the failure of a memory chip. Hard errors can appear like chip-level soft errors, but a difference is that the hard error is not typically rectified when the computer is rebooted. The solution to a hard error is typically to replace the memory chip or module entirely.
Failures can occur in many hardware and software components, and careful consideration must be given to all parts of the system to ensure that the mean time between system failures is acceptable. The main (volatile) store in such systems is one of the most critical areas, simply because there are more main store devices than any other type of system component.
Typically the memory devices are DRAM, and the main focus is tolerating soft DRAM data bit failures (e.g., because of their small feature size and sensitivity to soft error mechanisms). However, as the number of DRAM devices in a system grows, other soft failure mechanisms can become a significant system reliability issue.
It is common today even in small computing platforms to protect against soft data bit failures (both DRAM cell and data interface failures). In some high end servers, error protection mechanisms are spread across a number of memory devices (or even dual in-line memory modules or DIMMs), such that the loss of an entire memory device can be tolerated (not unlike Redundant Array of Independent Disks—Level 5 (RAID-5) tolerance to the loss of an entire hard drive). Such schemes typically include address, control, and data signals with error correction codes (ECCs), which has the desirable effect of detecting and recovering from soft failures in address and control interfaces, as well as data interfaces and memory cells.
One downside to this approach is that the smallest unit of transfer between the memory controller and the collection of memory devices can be quite large (e.g., 512 bytes). For some applications, such large block sizes can have a significant adverse impact on run time efficiencies. One class of applications for which this is true is large scale scientific/technical workloads that operate on large, sparse data sets. These workloads are in fact one of the most important for ultra-scale clusters. Hence, the most challenging main store reliability requirement is also the one which would most benefit from fine grain main memory access.