1. Field of the Invention
Embodiments of the present invention relate, in general, to communication and synchronization between threads in a multiprocessor system and particularly to mechanisms for fine-grained messaging between components of a multiprocessor system using a single instruction.
2. Relevant Background
Parallel computing on clusters of commodity multiprocessors has been gaining more attention in recent years. High-speed general-purpose networks and very powerful commodity multiprocessors are narrowing the performance gap between powerful commodity multiprocessor clusters and supercomputers. Processors in workstation clusters do not generally share physical memory, so all interprocessor communication between processors must be performed by sending messages over the network. Currently, the prevailing programming model for parallel computing on networks of workstations is message passing.
Parallel computing is the simultaneous execution of some combination of multiple instances (threads) of programmed instructions and data on multiple processors in order to obtain results faster. To support parallel (also called multithreaded) applications, multiprocessor systems provide a mechanism for communication and synchronization between the various processes (threads). Fundamentally, there are two mechanisms that provide such a communication and synchronization need. These two mechanisms are message passing and shared memory. The shared memory approach to parallel (multithread) processing utilizes multiple processors accessing a shared or common memory system. The complexity, however, to provide a fully cache-coherent shared memory is high, thus spawning several different approaches to address this need.
The other mechanism is generally known as message passing. Direct messaging is a form of message passing that features asynchronous, one-way messages that are handled by the recipient as soon as possible on receipt in order to minimize system complexity and message transport latency. Direct messaging is efficient, using hardware supported messages that can be sent and received in user mode with very few assembler instructions. By efficient, it's meant that messages as small as a cache line can be sent with high sustained bandwidth on the system interconnect. Direct messaging can be used to communicate a function/command to another processor or they can be used to communicate data. In either case the utility of direct messaging lies in a low-overhead mechanism to send and receive the messages.
This asynchronous form of communication enables pipelining of messages. Since the introduction of direct messaging, numerous implementations of hardware accelerated direct messaging have been proposed. Generally direct message communication is formulated as logically matching request and reply operations. Upon receipt of a request, a request handler is invoked; likewise, when a reply is received, a reply handler is invoked.
Current implementations of direct messaging are however not without their problems. Many current systems are not efficient and perform the discussed communications with high latency and poor sustained bandwidth for small (<64B) messages. Furthermore, current implementation of direct messaging sends content or instructions through multiple messages. For example, an instruction sending data to another destination may involve multiple instances of transferring data from memory to a scalability interface prior to the message being sent. Such multiple messages are subject to system and/or system interrupts resulting in inefficiencies and increased bandwidth demands due to the need to resend interrupted message series. Furthermore, other proposed implementations of direct messages are not compatible with commodity processor designs and instruction sets.