Symmetric multiprocessing (SMP) is a computer architecture that provides fast performance by making multiple CPUs available to complete individual processes simultaneously (multiprocessing). Any idle processor can be assigned any task, and additional CPUs can be added to improve performance and handle increased loads. A chip multiprocessor (CMP) includes multiple processor cores on a single chip, which allows more than one thread to be active at a time on the chip. A CMP is SMP implemented on a single integrated circuit. Thread-level parallelism (TLP) is the parallelism inherent in an application that runs multiple threads at once. A goal of CMP is to allow greater utilization of TLP.
Parallel programming languages (e.g., OpenMP, TBB, CILK) are used for writing multithreaded applications. The tasks to be performed in a multithreaded application may have well defined tasks to be performed in parallel (parallelism) so that different cores can be assigned different tasks. However, the tasks to be performed may exhibit irregular parallelism (e.g., operate on tree-based dynamic structures). Workqueuing is identification of tasks that can be performed in parallel and the queuing of these tasks. The tasks queued may then be dequeued and processed by processor cores having available processing power. The workqueuing execution enables a user to exploit irregular parallelism among dynamic data structures. Workqueuing is an effective technique to achieve high scalability performance for large number of processors.
In OpenMP, the workqueuing model is supported by taskq and task pragmas. The taskq pragma specifies the environment within which the enclosed units of work (tasks) are to be executed and the task pragma specifies the unit of work (task). When a taskq pragma is encountered a master thread initializes a queue based on the taskq pragma and executes the code within a taskq block serially. When a task pragma is encountered it conceptually adds the task to the queue created by the master thread. A captureprivate clause may be used ensure that a private copy of the link pointer is captured at the time each task is being enqueued. Slave threads dequeue the tasks from the queue and execute them.
In case of data dependence existing between master and slave threads, a value (heap variable) from a master thread (value producer) may need to be passed to a worker thread (value consumer). To avoid the value being overwritten by the master thread before the previous data is actually used by a slave thread, a memory copy operation may be used for passing the value from the master to the slave. The memory copy operation copies the data from a master thread to a slave thread to ensure that master and slave threads operate on different memory location. However, if excessive memory copy operations are performed, bus bandwidth to the shared CMP/SMP memory hierarchy can become saturated. Saturated bus bandwidth may lead to the memory copying experiencing high performance penalty.
Another possible approach for passing the value between master and slave threads is by having the master thread allocate memory space for each task. The master thread saves the data to these memory locations for later use by the slave threads. The slave threads read data from associated memory locations, perform computations on the data, and then deallocate the memory space after completion of the computation. This approach requires frequent memory allocation/deallocation operations, which causes poor memory system performance in the CMP/SMP memory hierarchy.