The present invention relates to computer architectures and in particular to circuits for mitigation of soft errors in computer architectures such as graphic processing units.
The increasing complexity and decreasing scale of integrated circuits used for electronic computers make such electronic computers increasingly susceptible to “soft errors”. Soft errors are generally those which do not reflect a fundamental failure in the circuit but rather an episodic error, for example, caused by a particle strike or random electrical noise which switches the state of a logical gate or memory cell. In this regard, soft errors can affect both the execution circuit of the computer (e.g. the ALU) by changing the state of logical gates and the memory circuit of the computer (e.g. the registers or other memory structures) by changing the state of a memory cell.
Known techniques for preventing soft errors include selecting packaging materials with low radioactivity and increasing the size of the circuit structures (so they are less susceptible to the small energy contributions of particle strikes). Known techniques for detecting and correcting soft errors include the addition of error detection and correction bits to memory and the use of redundant execution circuits (e.g. triple redundancy) to detect errors in the execution circuits and correct those errors through a majority vote or subsequent execution.
Graphic processor units (GPUs) are specialized electronic computers typically used for high-speed processing of graphical data. Such GPUs employ a large number of execution units and distributed memory registers. Historically soft errors have not been a significant concern in GPUs because occasional errors in graphic images are localized and easily ignored by the viewer.
GPUs are increasingly being applied to tasks previously assigned to general-purpose computing in which soft errors can significantly affect the validity of the results. The large number of execution units and registers of the GPU, however, can make it impractical to use conventional hardening techniques that increase the area of the devices or add redundant circuits for error detection.
It has been recognized that not all soft errors affecting a gate or memory cell will necessarily produce an error in the results of the computation. For example, errors in NOP instructions, logically masked bits, and dynamically dead code will not affect the computational output. Accordingly, efforts have been made to identify generally how susceptible a given architecture is to soft errors. Such information can generally guide the designer, for example, in where and how much hardening circuitry to employ.