As semiconductor technology continues to inch closer to practical limitations in terms of increases in clock speed, architects are increasingly focusing on parallelism in processor architectures to obtain performance improvements. At the integrated circuit device, or chip level, multiple processing cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Pipelining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multithreading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.
The net result of applying the aforementioned techniques is an ability to provide multithreaded processing environment with a pool of hardware threads distributed among one or more processing cores in one or more processor chips and in one or more computers, and capable of processing a plurality of instruction streams in parallel. It is expected that as technology increases, processor architectures will be able to support hundreds or thousands of hardware threads, and when multiple processors are combined into high performance computing systems such as supercomputers and massively parallel computers, a potential exists to support millions of hardware threads.
However, effective parallel processing requires that the software applications that run in a multithreaded processing environment take suitable advantage of multithreading capabilities. Software developers are typically more comfortable with developing single threaded applications since they typically follow the sequences of steps needed to perform desired tasks. Support for multithreading is often not as intuitive, and often requires consideration for minimizing conflicts and dependencies to minimize the frequency that threads may spend waiting for other threads to complete work that they need before they can complete their own work. For example, if one thread needs to calculate an average of some set of values that are being calculated by other threads, that thread will not be able to perform its operation until all of the other threads calculate their respective values. Threads that perform completely independent tasks, on the other hand, typically do not suffer from dependency concerns, so much of the effort associated with developing for multithreaded applications is devoted to breaking tasks up into relatively independent threads so that inter-thread dependencies are minimized.
Given the difficulties associated with developing multithreaded applications, a significant need has existed in the art for techniques for simplifying the development of multithreaded applications. For example, significant efforts have been made to programmatically convert single threaded application code into multithreaded application code during compilation, e.g., using an optimizing compiler. With one methodology, for example, fine grained parallelism is employed to convert in-order code in an instruction stream into multiple, small out-of-order code segments, and instructions are inserted into the instruction streams to pass data between the code segments in the form of variables. One type of instruction is a “put” instruction, which sends a variable to another thread, and another type of instruction is a “get” instruction, which retrieves a variable from another thread. Through the use of these instructions, synchronization between code segments executing on multiple threads can be maintained by stalling a code segment that has issued a get statement for a particular variable until another code segment has issued a corresponding put instruction for that variable.
While the use of put and get instructions can effectively maintain synchronization between dependent code segments executing on different hardware threads, any time that a thread is stalled waiting for a variable from another thread represents lost productivity, so it is desirable to minimize the latency associated with communicating variables between threads.
Therefore, a significant need exists in the art for a manner of efficiently communicating data between multiple threads in a multithreaded processing environment to minimize latencies for inter-thread dependencies.