Contemporary high-performance computing relies on the use of many processor cores to perform various intensive and complicated computations and processes. Some computing devices are specifically designed for such tasks and may include multiple processor sockets with each processor having multiple processor cores. As such, high-performance computing system may utilize fifty or more processor cores or threads to perform various workloads. Such systems may include “small cores,” a combination of “small” and “big” cores”, or all “big” cores. Small cores may be defined as lower feature processor cores designed for highly parallel computing, whereas big cores are defined as general purpose computer cores such as those typically found in standard server computing devices.
In a multi- or many-core system, the failure of a single processor core may result in an unrecoverable error of the entire system, including any remaining good cores. The potential of critical failure of the entire system is magnified with a larger number of processor cores. For example, in a system with fifty processor cores, the failure of one of the fifty processor cores can cause the failure of the entire system. Additionally, the failure of one processor core in one location of the processor die may place undue stress on adjoining cores and tiles. Further, the loss of processor cores can increase the workload of remaining cores, which may exacerbate any current problems in the processor core or tile. Some systems include software solutions to manage the processor core errors. However, such software solutions typically increase the workload overhead of the system and fail to consider core or tile layout and its effect on the health and throughput of continued computing on the system.