Customers of cloud computing services run a variety of high performing computing (HPC) applications on service provider resources, including Computer Aided Engineering (CAE) and Computer Aided Design (CAD) tools; molecular modeling, genome analysis, and other types of scientific modeling; numerical modeling for financial and manufacturing applications, or other applications used to perform research in physics, chemistry, biology, computer science, or materials science. In some cases, these customers use graphics processing units (GPUs) for high performance computing tasks in order to take advantage of the parallelism inherent in such processors. In fact, there exist several large ecosystems for using GPUs for high performance computing (e.g., from GPU vendors and computing platform vendors) that essentially turn GPUs into single chip super computers. For example, each GPU can be equivalent to multiple current-technology 16-core workstations.
Some high performance computing applications (e.g., financial and scientific modeling and simulation applications) are especially sensitive to bit errors, which can throw off their models significantly. In other words, correctness of computation is a critical requirement for some customer applications. However, many GPUs do not include hardware support for error checking and correction, or do not implement sufficient error detection and recovery for these types of applications.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that the embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to.