Multithreaded programming permits concurrent execution of computational tasks to improve application performance. Thread synchronization methods such as semaphores, mutual exclusion locks, and readers/writer locks are generally used to guarantee the atomicity of operations on shared data and to provide a consistent view of memory across concurrently executing threads. Multithreaded programming generally employs a model for assigning computational tasks to threads. Conventional models include creating a thread for each task and a thread pool (a special case of which is the Boss/Worker model). Creating a thread per task may cause performance issues when the frequency of task creation is high and mean task duration is low. A thread pool typically incorporates some form of queue data structure to manage the work/resources assignable to each thread in the pool. This in turn requires some form of synchronization to prevent threads interfering with one another while accessing the queue data structure.
In a thread pool model, a number of threads are created to perform a number of tasks which are usually organized in a queue referred to as a task queue. Typically there are many more tasks than threads. A thread requests the next task from the task queue upon completion of its current task. When all tasks have been completed (i.e., the task queue is empty), the threads can terminate or sleep until new tasks become available.
Thread synchronization mechanisms may cause execution bottlenecks when multiple threads are blocked while attempting to access a protected data structure or code segment. In addition to the overhead (and reduced concurrency) introduced by the use of a synchronization mechanism to parallelize an application, there may be overhead associated with the Thread Pool assignment model employed. The management of work/resources through the use of a queue generally requires synchronization as well. For example, in the Boss/Worker model, a main (Boss) thread performs the task of finding the work (i.e., filling the queue) with the worker threads selecting and completing the work from the queue. Since all worker threads require access to a single queue, synchronized access is generally required to provide a consistent view of the queue data structure among all executing threads. This may result in a performance bottleneck when multiple worker threads are blocked while attempting to access the queue.
Appendix I, which is incorporated herein by reference, shows the source code for matrix multiplication employing a conventional thread pool model of multithreading programming that incorporates the use of work/resource queues. Matrix A has dimensions (N, M) and matrix B has dimensions (M, K) and the results matrix C has dimensions (N, K). The worker threads process individual rows of A and individual columns of B resulting in individual elements of C. The total number of tasks that can be performed in parallel is N*K. It should be noted that matrix multiplication is intrinsically parallel in that the calculation of any of the tasks is independent of all the others. However, use of a queue to manage thread assignment typically reduces concurrency.
When the matrix multiplication is performed, a mutual exclusion (mutex) lock is acquired to ensure that only one matrix multiplication is in progress. A mutex lock typically is used to synchronize threads, usually by ensuring that only one thread at a time executes a critical section of code. The mutex locks are statically initialized to zero before use. The main thread (the boss thread) checks whether its worker threads have been created. If not, it creates one for each CPU.
Once the worker threads have been created, the boss thread sets up a counter of work to do and signals the workers with a condition variable. Each worker thread selects a row and column for the input matrices, then updates the row and column variables so that the next worker thread will get the next row or column or both.
The mutex lock is then released so that computing the vector product can proceed in parallel. When the results are ready, a worker thread reacquires the mutex lock and updates the counter of work completed. The worker thread that completes the last bit of work signals the boss thread that the matrix multiplication is complete.
Porting legacy code to utilize multithreading typically requires significant changes to the legacy code. As the code in Appendix I illustrates, the multithreaded code for matrix multiplication using a queue and locking mechanisms is very different from its single threaded counterpart. The multithreaded version typically involves porting of the single threaded counterpart to insert queue structures and locks to synchronize access to the queue structures. It should be noted that as the number of threads increases, contention for the queue increases due to the increased locking activity. This generally results in less concurrency, a result of using a non-parallel construct (e.g., a queue) to parallelize an application.
Additionally, the number of threads created for the thread pool with a queue structure is a parameter that typically has to be tuned for best performance. The cost of having a larger thread pool is increased resource usage. Additionally, too many threads may hurt performance due to increased context switching overhead while too few threads may not fully utilize all the resources.
What is needed is a method of assigning work to threads that does not require synchronization. What is further needed is a method that eliminates work queues to provide improved concurrency and increased application performance in a multithreaded programming environment.