High performance computing (HPC) solutions may apply a bulk-synchronous computational model to a large number of computing elements (e.g., processor cores) in which each computing element is assigned an approximately equal amount of work associated with one or more applications. At periodic and frequent milestones during the computation, each computing element may globally synchronize with the other computing elements in order to ensure correctness and to exchange data used in the next stage of the computation. A number of factors, however, may lead to load imbalances between the computing elements, wherein the load imbalances may in turn present challenges with regard to global synchronization. For example, manufacturing variations, increases in system scale, complexity of dividing application work into equally sized pieces, jitter induced by operating system (OS) daemons or services, non-uniform memory access (NUMA) latencies and unfairness between on-die interconnect routing protocols may all cause load imbalances that result in computing elements arriving at a particular global synchronization point at different moments in time. Moreover, overall application performance may be determined (and limited) by the last computing element to arrive at a synchronization point. Indeed, the computing elements that arrive early may waste a considerable amount of time and energy waiting at the synchronization point.