Massively parallel processing (“MPP”) systems may have tens of thousands of nodes connected via a network interconnect. Each node may include one or more processors (e.g., an AMD Opteron processor or an Intel Xeon with multiple processors), local memory (e.g., between 1-16 gigabytes), and a communications interface (e.g., HyperTransport technology) connected via a network interface controller (“NIC”) to routers of the network interconnect. Some of the local memory of each node may be accessible to the processors of the other nodes as shared memory. Thus, a processor of a node can access its own local memory and the shared memory of other nodes, referred to as remote memory for that processor. To access remote memory, a processor needs to send a remote memory access request through the network interconnect to the node where the remote memory is located. In contrast, to access local memory, the processor directly accesses the memory at that node. Because accessing remote memory requires sending a request through the network interconnect, the access time for remote memory is typically much longer than the access time for local memory. In addition, the access time for remote memory may vary considerably as the traffic on the network interconnect varies. For example, if many nodes frequently send memory requests to a single node, then the routes to that node and the shared memory of that node may become congested with a backlog of requests, resulting in increased access times.
Such MPP systems with shared memory are generally well suited for executing programs with significant amounts of parallelism. Certain classes of parallel algorithms, however, may present difficulties for conventional MPP systems because of very poor locality of reference and very little concurrency per thread. For example, such programs may process large amounts of data represented by a graph of vertices connected by edges. Such graphs may include tens of millions of vertices, each of which represents, for example, a web page with the edges representing hyperlinks between the web pages. Such programs distribute the storage of the vertices of the graph across multiple nodes. When such a program executes, the program may specify tasks that can be executed in parallel. The tasks are executed in parallel by the nodes of the system. When a task needs to access a vertex that is stored in remote memory, the task issues a remote memory access request. While this remote memory access request is outstanding, the task waits for the memory request to complete, with the processor performing no useful work.
Some operating systems support multiple threads of execution within a single process in which a program executes. Such an operating system may assign each task of a program to a separate thread. The operating system then controls the scheduling of the execution of the threads. For example, the operating system may select the threads in a round-robin manner and allow the selected thread to execute for no more than a certain time quantum. At the end of the time quantum (or when the executing thread transfers control to the operating system (e.g., an I/O request)), a context switch to the operating system occurs, and the operating system suspends the execution of the currently executing thread and selects another thread for execution. When the suspended thread is again selected for execution, the thread resumes its execution where it left off. To track where a thread left off, the operating system saves and restores the context of the threads (e.g., program counter and registers). Because context switching between one thread and another is performed by the operating system, it may take a considerable amount of time, because of the overhead of saving the state of the thread being suspended, entering and exiting the operating system, and restoring or initializing the state of the thread that is being switched to. In addition, the context switching to the operating system requires changing from a relatively low privilege mode (e.g., user privilege mode) to a high privilege mode (e.g., kernel privilege mode or operating system privilege mode) so that the operating system can access the critical resources (e.g., paging tables) and then back again to the low privilege mode to prevent the user program from accessing those critical resources.
Some parallel computer architectures have processors that facilitate the switching of execution from one thread to another thread, referred to as a multi-threaded processor (e.g., the Cray XMT processor). Each multi-threaded processor has multiple hardware threads and can execute multiple threads simultaneously. (A conventional or non-multi-threaded processor is considered to have only one hardware thread.) Every clock period, the processor selects a hardware thread that is ready to execute and allows it to issue its next instruction. Instruction interpretation may be pipelined by the processor so that the processor can issue a new instruction from a different hardware thread in each clock period without interfering with other instructions that are in the pipeline.
The state of a thread of a multi-threaded processor may comprise the data of a thread status register, some number of general purpose registers (e.g., 128), and other special purpose registers. To reduce the processing overhead of switching between threads at each clock period, a multi-threaded processor may include a complete set of these registers for the maximum number of threads that can be executing simultaneously. As a result, the state of each thread is immediately accessible by the processor without the need to save and restore the registers when an instruction of that thread is to be executed.
Because an MPP system that includes such multi-threaded processors can switch to execute different hardware threads at each clock period, when a hardware thread issues a remote memory access request, the multi-threaded processor can continue executing the next instruction of another hardware thread at the next clock period. Such a processor would not select for execution any thread waiting on a remote memory access. As a result, such multi-threaded processors will continue to perform productive work as long as there is at least one hardware thread that is not waiting on a remote memory access or on some other event (e.g., waiting for exclusive access to a data structure).
Most MPP systems, however, either do not have multi-threaded processors (i.e., only one hardware thread per processor) or do not have enough hardware threads (e.g., only two hardware threads per processor) to hide the latency of remote memory accesses and keep each processor busy when threads make frequent remote memory accesses. The operating systems of such MPP systems are typically responsible for scheduling the threads that are to run on each hardware thread with the resulting overhead in switching to and from the context of the operating system. Moreover, these operating systems do not typically switch threads on remote memory accesses. Thus, programs that have a high degree of parallelism, but frequently access remote memory with very little concurrency per thread, do not perform well on such MPP systems