The present invention relates generally to error recovery in a multi-threaded processor, and more specifically to a multi-threaded, multi-core processor configured to spare threads between cores.
Processors may be configured to execute one thread of instructions at a time, or multiple threads at the same time. Processors that are configured to execute multiple threads simultaneously are said to be in simultaneous multithreading (SMT) mode. In SMT mode, hardware resources are shared among multiple software threads executing on a machine. A thread typically exists within a process, and a process may have multiple threads that share computer resources, such as memory. A thread is considered the smallest unit of processing that can be scheduled by an operating system. In a multi-core processor that is executing multiple threads, the threads may be distributed across multiple cores in the processor, and each core may be configured to execute multiple threads simultaneously. Each thread appears to have its own complete set of architecture hardware.
Hardware errors may occur during execution of threads on an SMT processor, and when not detected, hardware errors may threaten data integrity. Software-based error recovery techniques may be used to address such errors; however, software-based error recovery may be relatively slow, requiring involvement from, for example, hypervisor code in the computing system. So as to avoid delays that may occur in software-based error recovery, hardware recovery may be implemented to restore a processor to a known good or safe when the processor detects an error. During a hardware recovery process, which may last for thousands of processor cycles, the processor stops executing an instruction stream in which the error occurred, clears out an internal corrupted state, restores itself to a known error-free state, and restarts processing of the instruction stream from a point where the instruction stream last halted, which may be a known good state, or a hardware checkpoint state. During the hardware recovery process, program flow is interrupted as the corrupted state is cleared and the known good state is restored; however, any software applications that are executing on the processor are not involved in the hardware recovery process. On a multi-core processor that is running in SMT mode, hardware recovery from a detected error may require that the recovery process be applied to all threads that are running on the core on which the error occurred, although the error may be isolated to a single thread. When repeated recovery actions on the same core cannot overcome the error, the hardware recovery process, for example, may include sparing the whole state of a faulty core to another core since the failing core is assumed to have non-correctable error(s). Such whole-core sparing may disrupt the progress of one or more threads that are successfully executing on the faulty core.