In a task-based multi-threaded application, the work required by the application can be represented as a task with a computational requirement. By using multiple threads to execute the work required by the application, the execution time can be shortened. Task-based multi-threaded applications rely on dynamic work stealing to maximize resource utilization and minimize execution time. Unfinished subtasks pending to be completed by a thread can be stolen by other available threads for execution.
The parallel_for, parallel_reduce and parallel_scan algorithms in Intel® Threading Building Blocks, for example, represent work of a multi-threaded application as a task with a computational requirement, and use recursive binary task division to dynamically create sub-tasks. When a thread picks up a task for execution, it examines the computational requirement of the task. If the thread determines that splitting the task is profitable or worthwhile, it divides the task into two subtasks.
FIG. 1 illustrates a prior art example of the recursive binary task division of a task. The binary task tree 100 shows the final result of the recursive binary task division of an original task 110. In the example of FIG. 1, five threads are assumed to be available for processing the original task 110 and the original task 110 is assumed to have a computational requirement of [0,100). For example, the original task 110 may be a task to iterate over a function one hundred times and thus the computational requirement of the original task 110 is set to a range from 0 (inclusive) to 100 (exclusive).
One thread out of the five available threads picks up the original task 110 for execution. Since there are four other available threads, it is worthwhile to split the original task 110 into subtasks for the four other available threads to steal for execution. The thread working on the original task 110 splits the original task 110 into a left subtask 120 and a right subtask 122 of an equal computational requirement. The left subtask 120 and the right subtask 122 have a computational requirement of [0, 50) and [50,100) respectively.
The recursive binary task division continues to split each subtask into two subtasks having equal computational requirements until the number of final subtasks is equal to the number of available threads. Each available thread is able to execute a respective one of the five subtasks 132, 134, 136, 140, and 142. However, even though the computational requirement of the original task 110 is divisible by 5 equal portions, the binary task tree 100 does not yield subtasks of equal computational requirements.
To mitigate the load imbalance of final subtasks, additional splitting and work stealing occurs to evenly spread the initial unbalanced load distribution. Each work stealing event however, incurs overhead.