Large scale clusters or supercomputers are usually used for executing scientific parallel applications with a large number of threads. Typically, a thread is spawned on a physical central processing unit (CPU). The nature of the applications is such that after a block of computation, the threads synchronize through barrier calls. This forms the compute-barrier kernel of most parallel applications (referred to as “Collectives”). A thread executing on a processor can be preempted if system activities, such as operating system (OS) daemons or interrupts, need to be scheduled. This slows down the thread that is preempted, thereby causing other threads on other processors to wait at the synchronization call.