Conventional computer architectures include processing devices with multiple processors configured to process sequences of programmed instructions such as threads of a program. The processors can be used to process tasks in parallel with other tasks of the program. During processing of the programs, amounts of parallel work (e.g., number of parallel tasks, amount of time to process parallel tasks, number of cycles to process parallel tasks) can vary over different portions or phases of the program. Processing delays, (e.g., delays in execution of a program) of one or more of these tasks can delay the execution of the program, negatively impacting performance.