The present invention relates generally to the electrical, electronic and computer arts, and, more particularly, to methods, apparatus and systems for selective duplication of subtasks.
In high-performance computing (HPC), typically two or more servers or computers are connected with high-speed interconnects in an HPC cluster. A cluster consists of several servers networked together that act like a single system, where each server in the cluster performs one or more specific tasks. Each of the individual computers or servers in the cluster may be considered a node. The nodes work together to accomplish an overall objective. As such, subtasks are executed on the nodes in parallel to accomplish the overall objective. However, a failure of any one subtask results in a failure of the entire parallel task.