Reliable system architecture is critical to many High Performance Computing (HPC) workloads. For example, modern graphics processing units (GPUs) with high compute density are suitable for HPC workloads except that the graphics memory subsystem (e.g., GDDR) does not support features to meet the reliability needs. Common failures in the memory subsystem include transient faults (TFs) and silent data corruption (SDC).
Augmenting the memory subsystem to make it reliable would require a change to an industry standard and cooperation from memory vendors. This augmentation would result in a more complex and more expensive system.