Computer architectures have grown in complexity from architectures using a single processor to architectures using parallel processors. In addition, High Performance Computing (HPC) may utilize processor groups to handle tasks according to various computational topologies and architectures. For example, an HPC application or job may be divided into various tasks that may be subdivided into groups of related subtasks, commonly referred to as threads, which may be run in parallel on a computational resource. In some architectures, related threads may be processed in parallel and completion of a task may require the completion of all related parallel threads that make up the task.
Computational efficiency may be enhanced by allowing parallel threads to be completed and/or to reach a milestone (e.g., a synchronization point, a global synchronization barrier, or more simply, a barrier) before progressing for further processing (if not already totally completed). Generally, individual threads may perform independent computations before they reach a synchronization point. The threads may complete their work at different times, however, due to variability of computational work among various kinds of tasks, differences that may arise in computational conditions, and so on. Thus, there may be a load imbalance among the computational resources employed, with some threads waiting for other threads to complete. The load imbalance may lead to inefficiencies in performance and power utilization, since computational resources may be idle while waiting for remaining tasks to be completed.