Multiprocessor, high performance computers are often used to solve large complex problems. FIG. 1 shows a typical multiprocessor computer system 10 which has a number of compute nodes 12 connected by a communication network 14. In the example embodiment shown in FIG. 2, each compute node (e.g. 12A) includes a CPU 15, a memory 17, and a network interface 18 joined together by a system interconnect (or “system bus”) 16.
To expedite the completion of computational problems, most applications designed to run on such computers split large problems up into smaller sub-problems. Each sub-problem is assigned to one of the compute nodes. Since there are a large number of compute nodes, many sub-problems can be worked on simultaneously. A program is executed on each of CPUs 15 to solve the part of the large problem assigned to that CPU. Each instance of the executing program may be referred to as a process. All of the processes execute concurrently and may communicate with each other.
Some problems cannot be split up into sub-problems which can be completed independently of one another. For example, the completion of some sub-problems may require intermediate results from other sub-problems. In such cases an application process must communicate with other application processes that are solving related sub-problems to exchange intermediate results.
Communication between processes solving related sub-problems often requires the repeated exchange of data. Such data exchanges occur frequently and communication performance in terms of bandwidth, and especially latency, are a concern. The overall performance of many high performance computer applications is highly dependent on communication latency.
Low latency communication between CPUs is implemented using one of two paradigms: messaging and shared memory. Messaging is used in computer systems having distributed memory architectures. In such computer systems each compute node has its own separate memory. A communication network connects the compute nodes together. For example, multiprocessor computer 10 in FIGS. 1 and 2 has a distributed memory architecture. Messaging involves sharing data by sending messages from one compute node to another by way of the communication network.
If a computer system directly implements, or emulates, memory sharing between compute nodes, data can be communicated by way of the shared memory. Some computers directly implement shared memory in hardware. Hardware-based shared memory systems are very difficult to implement in computer systems having more than about 64 compute nodes. Larger computer systems, which have hundreds or thousands of CPUs almost exclusively use distributed memory. In these systems, shared memory can be emulated on top of messaging, but performance is only marginally satisfactory.
Low-latency messaging can be implemented in a variety of ways. The “rendezvous protocol” is well suited for large messages. To avoid computationally expensive memory-to-memory copies, the rendezvous protocol copies messages directly from an application buffer in the sender's memory to an application buffer in the receiver's memory. To achieve this, the sender must learn the address of the receiver's application buffer. The sender engages in an interaction (referred to as a rendezvous) with the receiver. The sender sends a short message indicating that it wants to send a large message to the receiver. The receiver identifies an application buffer and responds with a short message indicating it is ready to receive the large message and the address of a suitable application buffer. The sender sends the large message to the receiver where it is stored in the receiving application's buffer. The sender finishes by sending another short message to the receiver indicating that it has completed the message transmission.
The “eager protocol” is suited for small messages and avoids the interaction overhead of the rendezvous protocol. Using the eager protocol, the sender sends the message to the receiver. The message is received into a temporary buffer. When the receiver is ready to receive the message, and an appropriate application buffer has been identified, the received message is copied from the temporary buffer to the application buffer. The eager protocol has the disadvantage of requiring a memory-to-memory copy at the receiver. For short messages the computational cost of this copy is less than the overhead of the rendezvous protocol.
To appreciate this invention, it is useful to understand how messaging is implemented at the sender and receiver. In a sending compute node 12A, network interface 18A is used to communicate with receiving compute node 12B. At the receiving compute node 12B, network interface 18B is used for communication with sending compute node 12A. Network interfaces 18A and 18B each provide control and data registers that are mapped into the memory address spaces of CPUs 15A and 15B respectively. The CPUs use the control and data registers to control communication.
Suppose that a process running on CPU 15A needs to make some data, which is in memory 17A, available to a process running on CPU 15B in the typical prior art computer system of FIGS. 1 and 2. Sending CPU 15A writes into the control and data registers of network interface 18A to send a message. There are two methods of doing this. In either method, CPU 15A writes the identity of the receiving compute node 12B into the control registers. If CPU 15A knows the destination address in receiving memory 17B, the destination address is written to the control registers of network interface 18A. Under the first method of sending a message, CPU 15A reads the message out of memory 17A under software control and writes the message into the data registers of network interface 18A.
Under the second method of sending a message, CPU 15A writes the address of the message in sending memory 17A into the control registers of network interface 18A. Network interface 18A uses a direct memory access (DMA) capability to transfer the message from sending memory 17A to network interface 18A. In both methods, network interface 18A constructs one or more packets containing the message and sends the packets via communication network 14 to receiving compute node 12B.
In modern high performance computers, the second method is used. This is predominantly because it allows CPU 15A to proceed with other work while the message is being transferred from memory 17A to network interface 18A. Under both methods, sending a message requires one or more writes to control registers of network interface 18A and the transfer of the message from memory to the network interface.
In a prior art computer, receiving CPU 15B is either interrupted by network interface 18B when a message arrives or CPU 15B continuously polls network interface 18B to detect when a message has arrived. Once CPU 15B learns that a message has arrived, it may write and read the control registers of network interface 18B to determine the size of the received message. CPU 15B can use either of two methods to transfer the received message to memory 17B.
In the first method, CPU 15B reads the message out of the data registers of network interface 18B and copies the message to a message buffer in memory 17B. In the second method, CPU 15B writes the address of a message buffer in receiving memory 17B to the control registers of network interface 18B. Network interface 18B uses a DMA capability to transfer the message to memory 17B. It can be seen that receiving a message requires one or more writes and possibly reads to control registers of network interface 18B and the transfer of the message from network interface 18B to memory 17B.
Until recently, most computer systems used system interconnects consisting of parallel address and data buses (e.g. PCI, PCI-X) to provide communication among CPUs, memory and peripherals. In such interconnects, the address buses typically have 32 or 64 signal lines. The data buses typically have 32, 64, or 128 signal lines. In some lower-performance systems, the address and data buses share the same signal lines. Providing such address and data buses requires the provision of 64 to 192 signal traces on circuit boards between the CPU, memory, and peripherals.
To read a data value from memory or a peripheral, a CPU drives an address value on to the address bus, waits for a short period of time, and reads a data value off the data bus. To write a data value to memory or a peripheral, a CPU simultaneously drives an address value on to the address bus and a data value on to the data bus.
Over the years, performance gains have been achieved by increasing the speed of the address and data buses. However, it is increasingly difficult to operate parallel buses at higher speeds. Signal skew and signal reflections on the various signal lines of the bus and crosstalk between signal lines are limiting the speeds at which parallel buses can be driven. Signal skew results from signals traveling on unequal signal trace lengths, signal interference, etc. Signal reflections result from the presence of imperfectly impedance-matched connectors located part way along the signal lines.
Because the signal lines of traditional buses are used in a half duplex mode to transfer data in both directions, it is necessary to insert wasted clock cycles to allow signals in one direction to die down before the bus is used in the other direction. Many such buses also have a bus mastership component which provides entities on the bus with the ability to request and be granted the privilege of initiating read or write operations on the bus.
In the last few years, parallel address and data buses have been supplanted by parallel interconnects having a reduced number of signal lines and serial interconnects. Examples of such interconnects are HyperTransport™ (see, for example, HyperTransport I/O Link Specification, available from the HyperTransport Consortium, http://www.hypertransport.org/) RapidIO™ (see, for example, RapidIO Interconnect Specification; RapidIO Interconnect GSM Logical Specification; RapidIO Serial Physical Layer Specification; and, RapidIO System and Device Interoperability Specification, available from the RapidIO Trade Association, http://www.rapidio.org/) and PCI Express™ (see, for example PCI Express Base Specification; PCI Express Card Electromechanical Specification; and, PCI Express Mini Card Specification available from PCI-SIG, http://www.pcisig.com/). Such interconnects use fewer signal lines, careful matching of signal line lengths, and other improvements to drive signals further at speeds higher than are readily practical on wide parallel buses. Such interconnects are configured as properly-terminated point to point links that are no longer shared in order to avoid signal reflections. To avoid the delays associated with bus reversal of a half duplex bus, these interconnects use separate signal lines for the two directions of data transfer. Both types of interconnects operate at data rates that exceed 300 MBps (megabytes per second). The serial interconnects use Low Voltage Differential Signaling (LVDS) to achieve higher data rates and reduced electromagnetic interference (EMI).
Because the number of signal lines is typically less than the width of data being transferred, it is not possible to transfer a complete block of data in a single clock cycle. Instead, both types of interconnect package and transfer data in the form of packets.
The term “packetized interconnect” is used herein to refer collectively to interconnects which package and transfer data in the form of packets. Packetized interconnects may use parallel data paths which have fewer signal lines than a width of data being transferred or serial data paths.
Despite being packetized, packetized interconnects base data transfer on memory-access semantics. “Packetized interconnects” as used herein are distinct from communication links which use packet-based data communication protocols (e.g. TCP/IP) that lack memory access semantics.
Read request packets contain an address and number of bytes to be fetched. Read response packets return the requested data. Write request packets contain an address and data bytes to be written. Write confirmation packets optionally acknowledge the completion of a write. Beyond the basic operations of reading and writing, most packetized interconnects include more advanced operations. These include the atomic read-modify-write operation amongst others. Terminology differs between the various interconnect technologies.
Packetized interconnects use memory address ranges associated with memory and peripherals. Address ranges assigned to peripherals are used to access control and data registers. Unlike parallel buses, packetized interconnects use assigned address ranges to route packets to memory or a peripheral where the read, write, or other operation will be performed.
Memory of types commonly available does not directly support packetized interconnects. Instead a packetized interconnect terminates at a memory controller which places data from packets received by way of the packetized interconnect into a traditional parallel bus (e.g. SDRAM, DDR, RDRAM bus) for communication to the memory.
Because of the high speeds at which packetized interconnects operate, they are usually restricted to a physical operating region that is not much greater than a few meters in length. A signal propagating over signal lines longer than this length will degrade too much to be useful. As typically used, a packetized interconnect ties together the CPU(s), memory, and peripherals in a single compute node of a multiprocessor computer. Other communication technologies (e.g. Ethernet and TCP/IP, InfiniBand™) are used to communicate between compute nodes.
The inventors have realized that InfiniBand™ and similar technologies have many of the attributes of a packetized interconnect, but can carry data over distances which are somewhat longer (e.g. tens of meters) than can packetized interconnects. InfiniBand™ would be an undesirably complicated protocol to implement directly in a CPU, memory, and peripherals. InfiniBand™ is capable of acting either as a packetized interconnect or as a communication technology between compute nodes.
To send or receive a message in a modern multiprocessor computer that uses a packetized interconnect, the CPU issues read or write request packets to read or write network interface control registers. The network interface returns read response packets and possibly write confirmation packets. Message transfers between memory and the network interface similarly involve the use of packetized read or write operations.
CPU 15A uses a packetized interconnect to pass a message to network interface 18A. CPU 15A typically executes software which includes a driver for network interface 18A. The driver software may at least partially prepare some communication network packet headers (e.g. TCP and IP packet headers) and pass those headers with the application payload in the message. Upon receipt of the message, network interface 18A, strips the packetized interconnect packet headers from the message, adds any additional packet headers required by communication network 14 (e.g. an Ethernet header), and may update the communication network packet headers provided by CPU 15A (e.g. compute and write an appropriate checksum in the IP header). At network interface 18B, the process is reversed. Interface 18B inserts some or all of the received communication network packet into a packetized interconnect packet and forwards the packetized interconnect packet to memory 17B.
In high performance computing, communication latency is such a concern that it is important to reduce latency as much as possible. End to end latencies are typically 1 to 5 microseconds in modern high performance computers. Reducing end to end latency by 50-100 nanoseconds has a measurable impact on application performance. The inventors have discovered that control register operations and the conversion of messages between packetized interconnect packets and communication network packets can cause delays of these magnitudes.
There is a need to provide computer systems which achieve low-latency communications between compute nodes.