The present invention is directed to an efficient mechanism for message passing by avoiding the use of unnecessary message copies so as to enable zero copy transport. More particularly, the present invention is directed to the use of an interface mechanism which efficiently implements zero copy transport protocols. A protocol with two separate implementations is included which efficiently maps to two different network adapters with varying capabilities on parallel or distributed computer systems like the RS/6000 SP.
DMA: (Direct Memory Access): refers to a mechanism by which hardware engines (known also as DMA engines) are programmed to move data across a system without the CPU (central processing unit) being used in the movement of data. For instance, a DMA engine could be used to move data to/from system memory from/to a peripheral device such as a network adapter. This mechanism helps offload the CPU from the movement of data thus freeing it up for other work. In general, DMA refers to those techniques employed in computer systems in which certain clock cycles on the system bus are used by the memory system. This is generally known by those skilled in the art as xe2x80x9ccycle stealing.xe2x80x9d
Zero Copy transfers: This refers to the mechanism where a user buffer is directly transported to an intended target (such as a network device or a target buffer) by eliminating the use of CPU to effect the transport. Typically zero-copy transfers use DMA engines which avoid staging the transfers through any intermediate buffers on the node. Note that zero copy transfers do not have to be symmetric with respect to the source and target buffers. For instance one could have a zero copy transfer on the source where data is moved out of a user buffer into the network while the receiver uses the CPU to stage data through intermediate buffers before moving it to the target buffer or vice versa. However, from a CPU utilization perspective, it is preferable to avoid copies both on the source and on the target.
Reliable Transport: Refers to a transport mechanism where the transport protocol guarantees that messages submitted for sending are received by the target systems transparently to the application and where recovery from transient network failures is provided (including transient network adapter failures). This is typically accomplished in the art by ensuring that every packet sent is acknowledged by the receiver, and the sender retransmit the packet if an acknowledgment is not received in a well defined interval of time. The interval of time is a function of the efficiency parameters of the system (node, processor, network, etc.) and is known as the retransmit timer interval.
Pinning Buffers: In virtual memory systems, the operating system is able to change the physical location of a buffer (especially if it is moved out to disk and moved back to memory). DMA engines typically work using physical/real addresses. In order to ensure that the physical addresses do not change (for instance to ensure that DMA engines refer to the same logical buffer intended by the initiation of the DMA operation) operating systems provide xe2x80x9cpinningxe2x80x9d services to ensure that physical pages corresponding to the buffers in consideration are marked as non-pageable and thus to maintain the same real memory addresses until the buffer is unpinned.
Mapping Buffers: This operation refers to the establishment of mapping host buffers for use by a DMA engine. The operation typically consists of loading physical addresses corresponding to buffers in consideration onto a network device so that the network device may move incoming data associated with the buffers directly into and out of the physical memory addresses specified by the address map.
Posting Buffers: This operation is used to make an association between a communication layer tag and the underlying and a user buffer (which has been mapped). The tag is usable by any entity (for instance the network adapter) to determine the buffer into which an incoming message should be moved. Note that the mapping operation is a purely local operation and determines how a user buffer is accessed by a DMA engine. The post operation assumes that the communicating agents have agreed to use a specific tag value to refer to a specific user buffer.
Parallel and Distributed computer systems are becoming increasingly powerful in terms of rapid increases in the speed of each CPU (along with number of CPUs on each node), and the speed of the network which interconnects the various nodes on the system. However the memory bandwidth (the rate at which the CPU can move data from one location in memory to another) has not kept pace with the improvements in CPU and interconnect speeds. The protocol overhead in moving data from one node to another node in the system is therefore increasingly dominated by the CPU copy cost of moving data (typically from a user buffer into a pre-pinned and pre-mapped network buffer which is used in a FIFO (First In First Out) fashion). This is becoming especially true since as the size of databases and file systems has increased considerably causing the size of data blocks which need to be transported to also increase. It is therefore highly desirable for continued scaling of capacity in such systems to ensure that CPU overhead in the movement of data is eliminated. Offloading the CPU from data movement allows the CPU to be freed up to process other workloads and thereby increases the capacity of the entire system.
The present invention is employable in to any circumstance that involves data transport which can be offloaded. Further the source and target systems do not have to be symmetric (for instance the source can be a compute node and the target could be a storage node). Additionally the present invention supports asymmetric zero copy transfers. For instance one could employ zero copy transfer on the source, where data is moved out of a user buffer into the network, while the receiver uses the CPU to stage data through intermediate buffers before moving it to the target buffer or vice versa. However, it is preferable to employ zero copy transfer on both the source and target systems.
Copy overhead is eliminated in the present invention through the use of a reliable copy-avoiding message passing protocol. Support for copy avoidance includes active messages [U.S. Pat. No. 6,038,604], where the target memory is not known. Communication software is invoked so that the CPU is not engaged in staging the message through intermediate buffers in the communication software. The interfaces of the present invention allow the user to prepare (pin and map) the source/target buffer and enable the network device (adapter) to DMA (direct memory access) directly from the user buffer into the network. On the target/receiving side, the receiver posts user buffers to the network device so as to allow the network device to transfer, via a DMA mechanism, incoming data directly into the user target buffer. Matching of incoming messages to the appropriate buffer is done based on an algorithmically generated unique tag returned by the post. The posted tag has to be unique to ensure that the incoming data is moved to the intended target buffer. In order to make certain that the tags are unique, the communication software algorithmically generates tags that depend on message number, the source of the message, the packet within the message and an index where the buffer mapping is found. Note that the posting of the buffer may architecturally occur before the arrival of a message into the network device or after the arrival of message into the network device. However from an implementation standpoint, whether posting is done before or after message arrival depends on the adapter design. The present communication protocols address both these scenarios. Reliability of the transfer through an acknowledgment based flow control and retransmit mechanism is also provided and further includes several safety mechanisms to ensure that state maintenance required to accomplish a reliable zero copy transport is efficient.
In the present invention there are two implementations both of which enable efficient zero copy transfer of messages from the memory of one machine into the memory of another machine. These mechanisms include (a) establishment of an association between the user buffer and an algorithmically generated unique tag name on the network device for matching purposes, (b) management of message fragmentation at the source, (c) transport of the message, and (d) reassembly of the message at the target in network memory without staging through an intermediate host buffer.
Another key aspect of the present invention is that communication is performed in a reliable fashion over a possibly unreliable network via an acknowledgment/retransmit protocol (i.e. messages submitted to be transferred are guaranteed to reach the target even in the presence of transient network failures (adapter or switch interconnect) as long as the source/target do not fail and there is some communication path available between the corresponding network devices).
This aspect of the present invention is more particularly illustrated in patent application Ser. No. (PO-9-2000-0116) which is being submitted concurrently with the present application and which is assigned to the same assignee as the present invention.
Also, the present invention offers the choice of indicating if the buffer preparation/setup (pinning and mapping) is static or dynamic. It allows implementations to offer appropriate services based on user requirements. The choice of static vs. dynamic depends on multiple factors such as the size of the buffer, available resources to prepare the buffer, the time period the buffer is to be held, the xe2x80x9ccostxe2x80x9d of preparing the buffer each time as compared to the xe2x80x9ccostxe2x80x9d of preparing the buffer just once (and also as compared to the xe2x80x9ccostxe2x80x9d of copying with respect to the cost of preparing the buffer). The xe2x80x9ccostxe2x80x9d of preparing a buffer for zero copy transfer includes the cost of pinning the buffer and mapping its corresponding translation table onto a network device.
Accordingly, it is an object of the present invention to define an interface architecture that lends itself to efficient zero copy transport implementation.
It is a further object of this invention to provide a method for transmitting data directly from an address space of one process into the address space of another process via direct memory access and with zero copy transfer.
It is a further object of the present invention to ensure extremely rapid data transfer rates between systems.
It is also an object of the present invention to ensure that any and all direct memory access occurring from a different system occurs correctly without corrupting data.
It is also an object of the present invention to provide a mechanism to algorithmically generate a unique tag to be associated with the physical page addresses that constitute the buffer.
It is still a further object of the present invention to ensure address space integrity in a receiving data processing system by ensuring that a unique tag is matched with the posted setup before any data is transferred into user memory
It is yet another object of the present invention to take full advantage of direct memory access procedures and techniques available on network adapters.
It is also an object of the present invention to ensure that data is not transmitted from one system to another without the second system being fully prepared for its receipt and in particular fully prepared for DMA operations to users"" buffers.
It is still a further object of the present invention to enable users running applications in their own address spaces on one data processing system to be able to transfer data accurately and efficiently into a user""s address space in another data processing system whether or not that data processing system is remote or in fact contained within the same physical package or frame.
It is also an object of the present invention to facilitate the protocol that it known as reliable zero copy transport.
It is yet another object of the present invention to ensure that network device memory is efficiently used and that data structures exist and are organized to minimize the code and data footprint used on network devices.
It is yet another object of the present invention to the ensure that the tag table stored on a network adapter memory is never overrun and flow control is managed to guarantee progress without deadlocking.
It is yet another object of the present invention to ensure that zero copy protocol provides hooks for the application are notified on completion of communication events via the use of callback functions.
It is yet another object of the present invention to ensure that the zero copy protocol provides for both pull and push models of communication.
It is yet another object of the present invention to minimize control traffic across nodes employing zero copy transfers when there is special hardware on the network device.
It is yet another object of the present invention to allow dynamic pinning and mapping of user buffers based on when they are to be used for transport so as to ensure that the amount of memory pinned at any given time is small, if not even minimal.
It is yet another object of the present invention to ensure that the CPU utilization for protocol processing to effect data transfer is minimized so that the CPU can be freed up to do other work.
It is yet another object of this invention to ensure that communication bandwidth is effectively utilized.
Lastly, but not limited hereto, it is an object of the present invention to establish a data communication protocol that takes full advantage of direct memory access capabilities.
The recitation herein of a list of desirable objects which are met by various embodiments of the present invention is not meant to imply or suggest that any or all of these objects are present as essential features, either individually or collectively, in the most general embodiment of the present invention or in any of its more specific embodiments.