Multiprocessor computer systems comprise a number of processing element nodes connected together by an interconnect network. Typically, each processing element node includes at least one processor, a local memory, and an interface circuit connecting the processing element node to the interconnect network. The interconnect network is used for transmitting packets of information or messages between the processing element nodes.
Distributed shared memory multiprocessor systems include a number of processing element nodes which share a distributed memory and are located within a single machine. By increasing the number of processing element nodes, or the number of processors within each node, such systems can often be scaled to handle increased demand. In such a system, each processor can directly access all of memory, including its own local memory and the memory of the other (remote) processing element nodes. Typically, the virtual address used for all memory accesses within a distributed shared memory multiprocessor system is translated to a physical address in the requesting processor's translation-lookaside buffer (“TLB”). Thus, the requesting processor's TLB will need to contain address translation information for all of the memory that the processor accesses within the machine, which includes both local and remote memory. This amount of address translation information can be substantial, and can result in much duplication of translation information throughout the multiprocessor system (e.g., if the same page of memory is accessed by 64 different processors, the TLB used by each processor will need to contain an entry for that page).
Some multiprocessor systems employ block transfer engines to transfer blocks of data from one area of memory to another area of memory. Block transfer engines provide several advantages, such as asynchronous operation (i.e., by operating without further processor involvement after being initially kicked off by the processor, block transfer engines free up the processor to perform other tasks) and faster transfer performance than could be achieved by the processor (e.g., since block transfer engines do not use processor-generated cachable references, there is less overhead on the coherence protocol of the read-modify-write cycle, and cache blowouts can be avoided).
Unfortunately, existing block transfer engines suffer from problems that limit their utility. For example, since address translations are performed in on-chip TLBs at the requesting processors, external block transfer engines are prevented from being programmed using virtual addresses. Instead, with existing block transfer engines, user software makes an operating system (OS) call to inform the OS that it wants to transfer a particular length of data from a particular source (specified by its virtual address) to a particular destination (also specified by its virtual address). In response, the OS first checks whether it has address translations for all of the virtual addresses, and then generates separate block-transfer requests for each physical page. For example, if the virtual address range spans 15 physical pages, an OS may have to generate 15 separate queued block-transfer requests to cause 15 separate physical transfers. The large amount of overhead associated with such OS intervention means that much of the advantage that is associated with performing the block transfer in the first place is lost.
Clustered multiprocessor systems include collections of processing machines, with each processing machine including a single processor system or distributed shared memory multiprocessor system. Clustering advantageously limits the scaling required of a single OS, and provides fault containment if one of the machines should suffer a hardware or OS error. In a clustered system, however, memory accesses to remote machines are typically performed via a network interface I/O device that requires OS intervention to send messages, and can target only specific memory buffers that were reserved for this communication at the remote machine. Thus, memory must be specifically “registered” by a user process on the remote machine, which prevents the memory on the remote machine from being accessed arbitrarily. Also, state must be set up on the remote machine to direct the incoming data, or the OS on the remote machine must intervene to handle the data, copying the data at least once. More recently, some network interface cards have been designed to support user-level communication using the VIA, ST or similar “OS bypass” interface. Such approaches, while successful in avoiding OS intervention on communication events, do not unify local and remote memory accesses. Thus, programs must use different access mechanisms for intra-machine and inter-machine communication.
Thus, there is a need for a node translation mechanism for communicating over virtual channels in a clustered system that supports user-level communications without the need for OS intervention on communication events. There is also a need for a node translation mechanism that unifies local and remote memory accesses, thus allowing user programs to use the same access mechanisms for both intra-machine and inter-machine communications. Such a mechanism would allow communication with other nodes in a local machine to be handled in the same way as communications with nodes in remote machines. There is also a need for a node translation mechanism which supports low overhead communications in scalable, distributed memory applications that seamlessly span machine boundaries, provides protection, and supports remote address translation.