Current processing systems have multiple processing cores to provide parallel processing of computational tasks, which increases the speed of completing such tasks. For example specialized processing chips such as graphic processing units (GPU) have been employed to perform complex operations such as rendering graphics. A GPU is understood as a specialized processing circuit designed to rapidly manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display. GPUs may include hundreds if not thousands of processing cores since graphic processing may be massively parallelized to speed rendering of graphics in real-time. GPUs perform various graphic processing functions by performing calculations related to 3D graphics. These include accelerating memory-intensive work such as texture mapping and rendering polygons, performing geometric calculations such as the rotation and translation of vertices into different coordinate systems. GPUs may also support programmable shader programs, which can manipulate vertices and textures, oversampling and interpolation techniques to reduce aliasing, and very high-precision color spaces.
In multi-core systems, it is desirable to perform multi-threading in order to accomplish parallel processing of programs. Multi-threading is a widespread programming and execution model that allows multiple software threads to exist within the context of a single process. These software threads share the resources of the multi-core system, but are able to execute independently. Multi-threading can also be applied to a single process to enable parallel execution on a multi-core system. This advantage of a multi-threaded program allows it to operate faster on computer systems that have multiple CPUs, CPUs with multiple cores, or across a cluster of machines because the threads of the program naturally lend themselves to concurrent execution.
A task scheduler is a program or a module of a program that is responsible for accepting, ordering, and scheduling portions of the program to be executed on one or more threads that are executed by the cores in a multi-core system. These portions of a program are typically referred to as tasks. In any multi-thread capable system, scheduling and executing tasks requires synchronization. This synchronization introduces a serial point that effectively renders a multi-thread system singular and the subsequent effect on performance is explained with Amdahl's law. Amdahl's law states that if P is the proportion of a program that can be made parallel (i.e., benefit from parallelization), and (1−P) is the proportion that cannot be parallelized (remains serial), then the maximum speedup that can be achieved by using N processors is S(N)=1/(1−P)+P/N.
Presently, there are three synchronization mechanisms employed by computer programs to address ordering or serialization issues that arise when using multiple threads to parallelize program execution. The least expensive is atomic instructions or operations, which is the least costly in regard for the number of CPU cycles required to synchronize an operation. The second and next expensive mechanism is typically referred to as “lockless,” in which one or more atomic instructions are used to synchronize data and program operation. The third are mutual exclusion (Mutexes), critical sections, and locks. These mechanisms are typically used to guard a region of a program from multiple simultaneous access from multiple threads. Not only are these mechanisms the most expensive, they tend to suffer an additional issue in which if a user or thread is pre-empted in its execution while it owns the lock, it can serialize a program for a significant amount of time.
In addition to the cost of the serialization mechanism another factor must also be considered, namely, simultaneous accesses to that specific mechanism. This is typically referred to as “contention” and is directly related to the number of users, tasks, or threads attempting to synchronize the same portion of a program. Contention issues reduce the speed of execution because cores must wait for the completion of other tasks by other cores.
Therefore, to maximize the potential of a multi-thread system to run a program in parallel, the serial tasks managed by a task scheduler must be minimized. In smaller scale multi-thread systems, concurrent execution is relatively simple. For example, a program with 500-1000 tasks on four worker threads (e.g., one thread for graphics, one thread for artificial intelligence, etc.) will not encounter serious contention issues. However, as the number of tasks increases from more complex issues and the number of cores increases (e.g., 20,000 tasks on eight cores or more with hyper-threading), contention is a major issue in maximizing the parallel execution of the program.
The number of CPU cycles required to be executed during the synchronization is also a consideration. In the case of atomic operations, the CPU can only serialize a small amount of data (typically 4 to 8 bytes) in which the cost may only be the number of CPU cycles require to execute the instruction in addition to the number of cycles required to propagate the data change. However in the case of Mutexes and critical sections, not only is the atomic penalty incurred (since they are implemented using atomics), but in addition they are commonly used to perform much more complex work that cannot be expressed with a singular instruction. This additional complexity of work will incur many more CPU cycles, which in turn will increase the cost of the synchronization.
In this way, the overall cost of synchronization or the amount of serial execution can be described or computed as “TotalCost=Synchronization Mechanism Cost*CPU Cycles*Amount of Contention.” To reduce serialization to a minimum it is therefore required to consider and reduce the total cost of synchronization.
Thus, there is a need for a task scheduler that minimizes the amount of serial execution of program tasks in assigning threads to cores for parallel execution in a multi-core system. There is also a need for a task scheduler that organizes tasks in task groups and task group queues, which are in turn organized in a hierarchy for assignment to worker threads. There is a further need for a task scheduler that efficiently uses workers to perform tasks in parallel while minimizing locks. There is also a need for a task scheduler that minimizes the amount of contention a multi-core system incurs when multiple worker threads are attempting to acquire the same lock.