Computing devices have become increasingly complex. The hardware and software implementations within these devices are susceptible to software errors and hardware soft errors due to this complexity. A hardware soft error may be best understood as a transient hardware error that is not due to a defect in the hardware, but due to some condition that arose and caused an error that may not be repeatable; these may also be referred to as errors during processing.
In single processor systems, protections against these kinds of errors can be centrally located, which makes error recovery more controllable and simpler to implement. In multiple processor systems, the error recovery is much more complex, as there is no central place in which protections are located. This is further exacerbated by the need to maintain state information for multiple execution threads for each of one or more processors. When an error occurs, currently, all of the execution threads for all of the processors must be reloaded together with their entire configuration, which may include large amounts of data. It must be noted that the term ‘reloaded’ may also include being reset, re-initialized, and/or restarted.
In networks where other devices rely upon these complex devices to be available, the downtime in reloading all of the threads for all of the processors can become visible to the other devices. This in turn may slow or bring down the network as the device reloads.