High Performance Computing (HPC) systems may include a large number of nodes connected by a fabric for distributed computing. Moreover, an application is divided into tasks that run concurrently across the nodes in the HPC system. These tasks are broken down into sequential milestones, and tasks are expected to reach each of these milestones at the same time.
Unfortunately, if any node completes the work toward the next milestone more slowly than the other nodes, the progress of the entire application halts until the slowest task completes its work. When this happens, the application loses potential performance and power is wasted in the nodes that must wait.