Some multiprocessor systems exist today that have been designed to offer increased reliability using paired microprocessor cores. An exemplary system is described by Timothy J. Slegel et al. IBM'S S/390 G5 MICROPROCESSOR DESIGN, IEEE MICRO, March 1999, which has been used to achieve industry-leading reliability. However, this prior art design is based on an approach that completely duplicates an I (Instruction) unit and E (Execution) unit of the core. That is, on every clock cycle, signals coming from these units, including instruction results, are cross-compared in a R (Reliability) unit and the L1 cache. If the signals don't match, hardware error recovery is invoked. This checking scheme solves the problems associated with traditional checking, although at an additional cost in die area.
The R unit and L1 cache use traditional error-checking approaches. All arrays in a L1 cache unit are protected with parity except for the store buffers, which are protected with ECC. Since the L1 is a store-through design, another valid copy of the data will always be in L2 or in memory. As an aside, since the L2 is a store-in design, it is protected by ECC, because it often holds the only valid copy of data. If the R unit or L1 cache detects an error, the processor automatically enters an error recovery mode of operation. This process is done purely in hardware without any millicode intervention, since the processor may be in some indeterminate state that may not be able to run millicode. This error recovery mode also lets the processor recover while it is executing in millicode.
While this design approach has offered high reliability, the duplicated resources were not available even when high reliability was not required. However, some classes of applications offer natural resilience, and it is advantageous to enable systems with higher performance when executing such algorithms. Examples of such algorithms are digital content creation and graphics processing, where deviations from the numerically correct results are not noticed by viewers; and convergence-based algorithms, wherein a corrupted numeric value may increase the runtime, but not impact final result correctness.
Thus, for example, a soft error occurring at a low-order mantissa bit may cause one or two additional iterations to be performed, but making twice the number of cores available to the application will result in an overall speedup.
A single system may be used to execute resilient programs (e.g., financial forecasting and simulation), and those requiring high accuracy (e.g., financial transactions), either simultaneously, or at different times. A single application may also consist of components requiring high reliability, and those being naturally resilient.
FIG. 1 shows a prior art multiprocessor system 10 including multiple processor cores 12a, . . . , 12n (such as embedded on a single chip or system on Chip (SoC) interfaced with system components 14 comprising, for example, memory nest, interrupt controller, etc. Each core 12a, . . . , 12n communicates with system components, e.g., by receiving respective input signals 20a, . . . , 20n, and sending output signals 25a, . . . , 25n. 
A prior art multiprocessor system described in U.S. Pat. No. 7,065,672 entitled “Apparatus and methods for fault-tolerant computing using a Switching Fabric” describes a computer system having a switching fabric that communicates transactions asynchronously between data processing elements and a target processor. While this application describes a method for determining correct execution, voting is performed between a plurality of processors, the processors are not to be independently used, and are not shown to be independently usable for lack of switching fabric access. Furthermore, this prior art configuration is dependent upon the features of asynchronous switching networks and the operation of peripheral devices.
Current fault-tolerant systems do not enable both processors to provide independent operation when computational processes are naturally resilient, nor do they enable pairwise execution and checking when they are not.
Moreover, in the art, lockstep execution has been used to detect an execution failure. However, there exists no adequate recovery solution, nor a solution that implements a method to retry the execution.
It would be highly desirable to provide a system and method that provides a pairing facility that enables selective pairing of microprocessors for high reliable (fault-tolerant) implementations under software control.
It would be further highly desirable to provide a system and method that provides a pairing facility that enables dynamic configuration of selectively paired microprocessors, and provides in the system the ability to recover on failed lockstep execution, and to restart execution.