Building applications that are responsive, high-performance, and power-efficient is crucial to delivering a satisfactory user experience. To increase performance and power efficiency, parallel sections of a program can be executed by one or more threads running on one or more computing cores, on a central processing unit (CPU), graphics processing unit (GPU), or other parallel hardware. Typically, one thread, called the “main thread”, enters the parallel section, creates helper tasks, and notifies the other threads to help in the execution of the parallel section.
While task creation is typically inexpensive, notifying other threads can be relatively very expensive because it often involves operating system calls. For example, on a top-tier quad-core smartphone, the latency to signal threads waiting on a condition variable can be as high as 40 microseconds (approximately 90,000 CPU cycles). Each of the several parallel sections of code may take under 40 microseconds to execute, making such high signaling costs unacceptable for parallel execution. During signaling on the critical path of execution, execution of the parallel section does not begin on either the critical path of execution or another thread, initiated by the signaling, until the signaling is completed. Thus, rather than speeding up the original section of code on the critical path of execution, the parallelization slows down the execution on the critical path of execution by nearly a factor of two. Some of this latency can be recovered when the other thread executes a task in parallel with the task on the critical path of execution.