Computer architectures have grown in complexity from architectures using a single processor having single cores to architectures using aggregations of multi-core processors. In addition, High Performance Computing (HPC) may utilize processor groups to handle work according to various computational topologies and architectures. For example, an HPC application may be divided into various tasks that may be subdivided into groups of related subtasks (e.g., threads), which may be run in parallel on a computational resource. Related threads may be processed in parallel with one another as “parallel threads,” and the completion of a given task may entail the completion of all of the related threads that form the task.
Computational efficiency may be enhanced by allowing parallel threads to be completed at the same time, reaching a defined point nearly simultaneously. Individual threads may, however, arrive at the defined point at different times in parallel applications. The variation may be due to a variability of computational work among various kinds of tasks, differences that may arise in computational conditions (e.g., variation in timing due to delays in accessing data from memory and/or from remote resources), and so on. Thus, there may be a load imbalance among the computational resources employed, with some threads waiting for other threads to complete. The load imbalance may lead to inefficiencies in performance and power utilization, since computational resources may be idle while waiting for remaining tasks to be completed.