The present disclosure relates generally to parallel computing. In particular, the present disclosure relates to thread handling in multithreaded parallel computing of nested threads.
Parallel computing is becoming more widely used as the number of CPU cores provided on a single chip increases. Massively parallel processors (MPPs) provide powerful parallel processing capabilities, however are limited due to their coarse grain parallelism with respect to applications having irregular parallelism. Explicit Multi-Threading (XMT) was developed to provide high performance general-purpose parallel computing using fine grained parallelism, backwards compatibility with existing serial programs, down-scaling of parallelism, superior performance with respect to serial emulations even when the code provides a very limited amount of parallelism, and general scaling of parallelism (see “Explicit Multi-Threading (XMT): A PRAM-On-Chip Vision”, described in http://www.umiacs.umd.edu/users/vishkin/xmt/, which is herein incorporated by reference in its entirety. XMT uses a Single Program Multiple Data (SPMD) computer programming language which is capable of executing in serial or parallel modes, providing the computational power of parallel programming and the flexibility to handle varying levels of parallelism. Using SPMD, explicitly defined virtual threads may be executed, or may be derived from parallel or serial programs.
Parallel Random Access Model (PRAM) is a popular abstract shared memory algorithmic model suitable for parallel programming, as described in JaJa, J., “An Introduction to Parallel Algorithms”, Addison-Wesley (1992), which is herein incorporated by reference in its entirety. The XMT model is a hybrid of several known models combining features from arbitrary concurrent-read, concurrent-write (CRCW) PRAM (for supporting an arbitrary number of virtual threads), queue-read, queue-write (QRQW) PRAM, as described in Gibbons, P. B., “Efficient Low-Contention Parallel Algorithms”, ACM Symposium on Parallel Algorithms and Architectures, 236-247 (1994), and a constant-time limited parallel, variant of fetch-and-add, as described in Gottlieb, A., “The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer”, IEEE Trans. Comp., 175-189, February 1983) which are both herein incorporated by reference in their entirety. Ramachandran, V., “Emulations Between QSM, BSP and LogP; A Framework for General-Purpose Parallel Algorithm Design”, In Proc. Of 1999 ACM-SIAM Synp. On Discrete Algorithms (1999), which is herein incorporated by reference in its entirety, describes a QSM model used to design and analyze general-purpose parallel algorithms, using algorithms which are adaptations of PRAM algorithms, along with a suitable cost metric.
The SPMD language uses Spawn and Join commands. The Spawn command is involved in facilitating transition from serial mode to parallel mode in which a plurality of parallel threads can operate concurrently. Each thread terminates with a Join command. Once all parallel threads have terminated, transition from parallel mode to serial mode occurs. The XMT architecture is described in the following references, all of which are herein incorporated by reference in their entirety: Naishlos, D., “Towards a First Vertical Prototyping of an Extremely Fine-Grained Parallel Programming approach”, TOCS 36, 521-552, (Special Issue of SPAA2001) (2003); Vishkin, U., “Explicit Multi-Threading (XMT) Bridging Models for Instruction Parallelism (extended abstract), Proceedings of the 10th ACM Symposium on Parallel Algorithms and Architectures, 140-151, (1998); U.S. Pat. No. 6,542,918, by Vishkin, U., entitled “Prefix Sums and An Application Thereof”; U.S. Pat. No. 6,463,527, by Vishkin, U., entitled “Spawn-Join Instruction Set Architecture For Providing Explicit Multithreading” and its CIP 10/236,934; U.S. Pat. No. 6,768,336, by Vishkin, U., entitled “Circuit Architecture For Reduced-Synchrony On-Chip Interconnect”; and U.S. patent application Ser. No. 11/606,860 “Computer Memory Architecture for Hybrid Serial and Parallel Computing Systems” filed Nov. 29, 2006, claiming priority to U.S. Provisional Patent Application 60/740,255, filed Nov. 29, 2005.
In an XMT machine a thread control unit (TCU) executes an individual thread. A plurality of TCUs may be executing respective threads simultaneously. Upon termination of the virtual thread, e.g., via a JOIN command, the TCU performs a prefix-sum operation in order to receive a new thread ID. The TCU then executes a next virtual thread with the new ID. The plurality of TCUs repeat the process until all of the virtual treads have been completed.
One SPMD model, referred to as the programming model, implements a PRAM-like algorithm and incorporates a prefix-sum statement. The parallel prefix-sum command may be used for implementing efficient and scalable inter-thread synchronization by arbitrating an ordering between the threads.
The SPMD programming model may be extended to support single SPAWN operations in which a thread performs a single SPAWN operation to introduce one new virtual thread as the need arises. Single SPAWN commands from multiple threads may be performed in parallel. The single SPAWN capability allows for programming that is more asynchronous and dynamic than the above programming model.
However, the capability of single Spawn operations is limited to one level of nesting, so that with each single Spawn command each TCU can generate a virtual thread in addition to the thread it executes that is associated with the original SPAWN command.
A need exists for providing an XMT system in which SPAWN commands may be nested within nested SPAWN commands for providing multiple levels of nesting in association with an original SPAWN command for generating multiple virtual threads in association with the original SPAWN command.
A need exists for providing an XMT system which allocates the multiple virtual threads associated with the single SPAWN commands to TCUs.
A need exists to provide an XMT system in which initialization and other data associated with a virtual thread associated with a single SPAWN command is transferred to a TCU executing the virtual thread.
A need exists to provide an XMT system in which the aforementioned allocation of virtual threads and transfer of data associated with the virtual threads is implemented without undo synchronization and repeated wait periods which reduce efficiency.