Multiprocessor, high performance computers (e.g. supercomputers) are often used to solve large complex problems. FIG. 1 shows schematically a multiprocessor computer 10 having compute nodes 12 connected by an inter-node communication network 14. Each node 12 has a network interface 16, which provides a data connection to inter-node communication network 14, at least one processor 18, and a memory 20. In FIG. 1, the network interface 16, processor 18 and memory 20 are shown explicitly for only two of the illustrated nodes. Processors 18 may conveniently comprise microprocessors. One example microprocessor which is currently available is the AMD Opteron™microprocessor.
Software applications running on such computers split large problems up into smaller sub-problems. Each sub-problem is assigned to one of compute nodes 12. A program is executed on one or more processors of each compute node 12 to solve the sub-problem assigned to that compute node 12. The program run on each compute node 12 has one or more processes. Executing each process involves executing a sequence of software instructions. All of the processes execute concurrently and may communicate with each other.
Some problems cannot be split up into sub-problems which are independent of other sub-problems. In such cases, to solve at least some of the sub-problems, an application process must communicate with other application processes that are solving related sub-problems to exchange intermediate results. The application processes cooperate with each other to obtain a solution to the problem.
Communication between processes solving related sub-problems often requires the repeated exchange of data. Such data exchanges occur frequently in high performance computers. Communication performance in terms of bandwidth, and especially latency, are a concern. Overall application performance is, in many cases, strongly dependent on communication latency.
Communication latency has three major components:                the latency to transfer a data packet from a CPU or other device in a sending compute node to a communication network;        the latency to transfer a data packet across the communication network; and,        the latency to transfer a data packet from the communication network to a device such as a CPU in a receiving compute node.        
In attempts to reduce latency, various topologies (e.g. hypercube, mesh, toroid, fat tree) have been proposed and/or used for interconnecting compute nodes in multi-node computer systems. These topologies may be selected to take advantage of communication patterns expected for certain types of high performance applications. These topologies often require that individual compute nodes be directly connected to multiple other compute nodes.
Low latency communication between processors in multiprocessor computers can be implemented using one of two paradigms: messaging and shared memory. Messages are used to communicate between nodes in distributed memory systems where each node has its own separate memory and a communication network connects the nodes together. For example, multiprocessor computer 10 in FIG. 1 is a distributed memory system.
If the nodes of a multiprocessor computer directly implement or emulate the sharing of memory, data can be communicated through the shared memory. One node can write into a shared data structure in the shared memory data to be read by one or more other nodes. Some computers directly implement shared memory in hardware. Hardware-based shared memory is very difficult to implement in computers having more than about 64 processors, because the performance of existing cache coherency technologies does not scale well.
Larger computers of hundreds and thousands of processors almost exclusively use distributed memory. Messaging is used to implement low latency communication between processors. In these systems, shared memory is sometimes emulated on top of messaging to provide an alternative for applications that were developed to use shared memory for communication.
One issue in emulating shared memory concerns the addressability of memory. High performance multiprocessor computer systems can incorporate large amounts of physical memory. For example, the inventors have designed a computer system which can incorporate 96 TB of physical memory. Memory density is anticipated to grow and costs will decrease. In the next few years, similar computer systems will probably incorporate in excess of 256 TB of physical memory. Directly addressing such large amounts of memory requires long addresses. For example, 48 bit addresses would be needed to directly address 256 TB of memory.
Unfortunately, some addressing systems which might be convenient to use within nodes of a computer 10 do not permit such long addresses. CPUs vary in their ability to support large address spaces. 32 bit CPUs only support 32 bit addressing. Some 64 bit CPUs (e.g. the AMD Opteron™) support 64 bit addressing inside the CPU, but only 40 bit addressing on the address bus external to the CPU. These CPUs are not capable of directly addressing 256 TB of physical memory.
Another issue is that of access rights to memory. In a 12,000 processor system, more than one application may be executing simultaneously. The ability of one application to access the memory assigned to another application must be carefully controlled. Applications must be prevented from accessing memory used by an operating system.
Another issue relates to control over communications used to emulate shared memory. Unless the system provides a global address space that spans all nodes, implementing shared memory may involve mapping memory from one node into the address space of another node. Since operating systems normally manage memory resources and the mapping of virtual addresses to physical addresses, it would be natural to make the operating system responsible for the communications that implements shared memory. This may have the undesirable side effect of making it practically necessary for the operating system to have a role in all communications, including supporting message-based communication. In this case, an application would have to make a system call to the operating system to send or receive a message.
Unfortunately, system calls significantly increase latency. A system call causes a software interrupt CPU instruction. When the software interrupt instruction is executed, the CPU is forced to execute an interrupt routine. To execute the interrupt routine, a typical CPU must switch to privileged execution mode. The memory management unit in the CPU must be flushed and reloaded with the virtual address to physical address mappings for operating system memory. The CPU caches will be invalidated and flushed, because operating system code is now executing. The interrupt routine must determine which system call was made. If the system call is simple, the interrupt routine may execute the necessary code and return the results directly. If not (message sending and receiving is typically not simple), the interrupt routine adds the system call parameters to an internal work queue to be processed at some later time when the kernel acquires extended use of the CPU.
All of this complexity leads to excessive latency. Current practice in high performance computing is to bypass the operating system for message sending and receiving. This exacerbates the previous access privileges issue, because now applications are directly accessing memory to send and receive messages. This potentially allows applications to interfere with the operating system messages that implement shared memory.
There is a need for multi-node computer systems which have mechanisms for providing low-latency messaging between nodes and which address some or all of the above-noted problems.