As the core count on processors used for high-performance computing continues to increase, the performance of the underlying memory subsystem becomes more important. In order to make effective use of the available compute power, applications will likely have to become more sensitive to the way in which they access memory. Applications that are memory bandwidth bound should avoid extraneous memory-to-memory copies. For many applications, the memory bandwidth limitation is compounded by the fact that the most popular and effective parallel programming model, Message Passing Interface (“MPI”), mandates copying of data between processes. MPI implementers have worked to make use of shared memory for communication between processes on the same node. Unfortunately, the current schemes for using shared memory for MN can require either excessive memory-to-memory copies or potentially large overheads inflicted by the operating system (“OS”).
In order to avoid the memory copy overhead of MPI altogether, more and more applications are exploring mixed-mode programming models where threads and/or compiler directives are used on-node and MPI is used off-node. The complexity of shared memory programming using threads has hindered both the development of applications as well as the development of thread-safe and thread-aware MN implementations. The initial attractiveness of mixed-mode programming was tempered by the additional complexity induced by finding multi-level parallelism and by initial disappointing performance results.
Portable Operating System Interface (“POSIX”) based operating systems generally support shared memory capability through two fundamental mechanisms: threads and memory mapping. Unlike processes, which allow for a single execution context inside an address space, threads allow for multiple execution contexts inside a single address space (note, a “process” is defined here as the combination of an execution context plus an address space). When one thread updates a memory location, all of the threads sharing the same address space also see the update. A major drawback of threads is that great care must be taken to ensure that common library routines are reentrant, meaning that multiple threads could be executing the same library routine simultaneously. For non-reentrant functions, some form of locking is used to ensure atomic execution. The same is true for data accessed by multiple threads—updates are atomic with respect to one another or else difficult to debug race conditions may occur. Race conditions and fundamentally non-deterministic behavior make threads difficult to use correctly.
In memory mapping, cooperating processes request a shared region of memory from the OS and then map it into their private address space, possibly at a different virtual address in each process. Once initialized, a process may access the shared memory region in exactly the same way as any other memory in its private address space. As with threads, updates to shared data structures in this region are atomic.
Explicit message passing is an alternative to shared memory for intra-node data sharing. In message passing, processes pass messages carrying data between one another. No data is shared directly, but rather is copied between processes on an as necessary basis. This eliminates the need for re-entrant coding practices and careful updates of shared data, since no data is shared. The main downside to this approach is the extra overhead involved in copying data between processes.
In order to accelerate message passing, memory mapping is often used as a high-performance mechanism for moving messages between processes. Unfortunately, such approaches to using page remapping are not sufficient to support MPI semantics, and general-purpose operating systems lack the appropriate mechanisms. The sender must copy the message into a shared memory region and the receiver must copy it out—a minimum of two copies must occur.
As of MPI 2.0, MPI applications may make use of both threads and memory mapping, although few MPI implementations provide full support for threads. More commonly, MPI implementations utilize memory mapping internally to provide efficient intra-node communication. During MPI initialization, the processes on a node elect one process to create the shared memory region and then the elected process broadcasts the information about the region to the other processes on the node (e.g., via a file or the sockets API). The other processes on the node then “attach” to the shared memory region, by requesting that the OS map it into their respective address spaces.
Note that the approach of using shared memory for intra-node MPI messages only works for the point-to-point operations, collective communication operations, and a subset of the MPI-2 remote memory access operations. Copying mandates active participation of the two processes involved in the transfer. Single-sided put/get operations, such as those in the Cray Shared Memory (“SHMEM”) programming interface, cannot be implemented using POSIX shared memory.
There are several limitations in using regions of shared memory to support intra-node MPI. First, the MPI model doesn't allow applications to allocate memory out of this special shared region, so messages must first be copied into shared memory by the sender and then copied out of the shared region by the receiver. This copy overhead can be a significant performance issue. Typically there is a limitation on the amount of shared memory that a process can allocate, so the MPI implementation must make decisions about how to most effectively use this memory in terms of how many per process messages to support relative to the size of the contents of each message. The overhead of copying messages using shared memory has led researchers to explore alternative single-copy strategies for intra-node MPI message passing.
One such strategy is to use the OS to perform the copy between separate address spaces. In this method, the kernel maps the user buffer into kernel space and does a single memory copy between user space and kernel space. The drawback of this approach is that the overhead of trapping to the kernel and manipulating memory maps can be expensive. Another limitation is that all transfers are serialized through the operating system. As the number of cores on a node increases, serialization and management of shared kernel data structures for mapping is likely to be a significant performance limitation. Another important drawback of this approach is that there are two MPI receive queues—one in the MPI library and one in the kernel. When the application posts a non-specific receive using MPI_ANY_SOURCE, great care is taken to insure that the atomicity and ordering semantics of MPI are preserved. There is a potential race for a non-specific receive request to be satisfied by both the MPI library and the operating system. Managing atomicity between events in kernel space and user space is non-trivial.
Another strategy for optimizing intra-node transfers is to use hardware assistance beyond the host processors. The most common approach is to use an intelligent, programmable, network interface to perform, the transfer. Rather than sending a local message out to the network and back, the network interface can simply use its direct memory access (“DMA”) engines to do a single copy between the communicating processes. The major drawback of this approach is serialization through the network interface, which is typically much slower than the host processor(s). Also, large coherent shared memory machines typically have hardware support for creating a global shared memory environment. This hardware can also be used when running distributed memory programs to map arbitrary regions of memory to provide direct shared memory access between processes. The obvious drawback of this approach is the additional cost of this hardware.
More recently, a two-level protocol approach for intra-node communication uses shared memory regions for small messages and OS support for page remapping individual buffers for large messages has been proposed. There has also been some recent work on optimizing MPI collective operations using shared memory for multi-core systems.
All communication between processes on the Cray XT use the Portals data movement layer. Two implementations of Portals are available. The default implementation is interrupt driven and all Portals data structures are contained inside the operating system. When a message arrives at the network interface of the Cray XT, the network interface interrupts the host processor, which then inspects the message header, traverses the Portals data structures and programs the DMA engines on the network interface to deliver the message to the appropriate location in the application process' memory. This implementation is referred to as Generic Portals (“GP”) because it works for both Catamount on compute nodes and in Linux on service and I/O nodes. The other implementation supports a complete offload of Portals processes and uses no interrupts. When a message arrives at the network interface, all of the Portals processing occurs on the network interface itself. This implementation is known as Accelerated Portals (“AP”) and is available only on Catamount, largely due to the simplified address translation that Catamount offers.
For intra-node transfers, the Generic Portals implementation takes advantage of the fact that Portals structures for both the source and destination are in kernel space. The kernel is able to traverse the structures and perform a single memory copy to move data between processes, since all of user space is also mapped into kernel space. At large message sizes, it becomes more efficient for the kernel to use the DMA engines on the network interface to perform the copy, so there is a crossover point where it switches to using this approach. For the Accelerated Portals implementation, all Portals data structures are in network interface memory, so it must traverse these structures in the same way it does for incoming network messages, so there is little advantage to intra-node transfers. In fact, intra-node transfers are slower going through the network interface rather than the operating system, due to the higher speed of the host processor relative to the network processor.