The present invention is directed to an efficient method and system for transmitting messages from a user""s address space on one system directly into a user""s address space in a second system using direct memory access. In addition, the present invention is also directed to a data transmission protocol to eliminate unnecessary retransmission of data packets.
The present invention is employed in a number of different circumstances. It is employed in data processing systems which are remote from one another and which communicate by means of data packet transmission from a source system to a receiving system. Additionally, the present invention is employed in SANs (system area networks) systems which represent nodes or clusters of processors which are packaged as a single unit (frame/rack in a frame). The other application, running on a possibly different physical system, may belong to the same user or to a different user.
The present application is not, however, directed to the usual protocols for message transmission of data from one system to another. In particular, the present invention is specifically directed to protocols that utilize direct memory access (DMA) hardware and techniques for directly (zero copy) transferring information from the address space of a user""s application running on one system directly to the address space of another or the same user running on a second system. With respect to direct memory access, it is a method which provides the most efficient mechanism for transferring messages and information into specific memory locations of another process. In particular, direct memory access avoids passage of data through central processing units. This is an exceedingly fast mode of operation and it is the most efficient. When the DMA access is used to go directly to a user buffer rather than a system buffer in a message passing system, it is referred to as a zero-copy protocol.
The zero-copy protocol requires that the receiving system must be fully prepared to receive the amount of data sent and must also have a mechanism for specifically identifying the exact location for data storage. Since the transmission contemplated in the present invention is directly into the memory locations of another processing system, it is important that the receiving system be sufficiently prepared to receive such transmission. It is important because the right amount of data must be supplied to exactly the correct memory locations. If this is not the case it is possible that data is either lost or corrupted and in fact, this could conceivable be data belonging to a different user than the one who is transmitting the message packets. Clearly, corruption or loss of data in this manner is an unacceptable operating condition.
Because data transmission is directed to specified real memory locations which are not statically associated with fixed virtual memory locations via DMA procedures, error conditions which arise be particularly difficult to handle. In particular, if the sender never receives an acknowledgment from the receiver that a particular data packet has been received, it is undesirable for the sender to resend the same packet. In particular, if the sender were to wait for a given period of time (elapsed xe2x80x9ctime outxe2x80x9d amount) and were not to receive an acknowledgment from the receiver, a retransmission of this sort could resort in a wasted transmission and/or the insertion of incorrect data into inappropriate memory locations (or real memory which is not associated with the intended virtual memory target location). Accordingly, in accordance with the present invention the sender negotiates retransmission with the receiver. This avoids unnecessary transmission of data particularly in the event that the only error that has occurred is a loss of the acknowledgment on its return from the receiver to the sender. Such circumstances do not warrant the retransmission of another data packet but rather only require an indication that the data has indeed been received. Nonetheless, if there is a more significant problem than the mere loss of an acknowledgment returned to the sender, the sender must ensure that receiver has prepared the zero copy buffers for DMA access prior to retransmitting the packet.
In Clustered systems (SANs), the processing power of each node in the cluster (CPU speeds) is increasing very fast and so are the speeds of the interconnects linking the various nodes in the cluster. However, the memory bandwidth (a function of how fast the CPU can move data from one region of its memory to another) is not keeping pace with the CPU speeds and the interconnect speeds. As a result the cost of protocol processing in reliable messaging systems is being increasingly dominated (bottlenecked) by the copy cost in the protocol path. The copy cost also increases with the size of the message being transported. Increasingly in clustered systems, with the emergence of technologies like clustered file systems, the size of data that needs to be transported from one node to another has been consistently increasing, hence the need for zero copy protocols. In order to reliably transport data in a zero copy fashion, the acknowledgments are used to ensure guaranteed delivery. It is wasteful to have to retransmit large data packets if the acknowledgment were lost. Hence the motivation for this invention where we use a small control message to negotiate with the receiver to ensure that a retransmit of the zero copy transported packet is required. We limit ourselves to the design of an efficient retransmission mechanism for reliable zero copy transport mechanism.
Reliable Transport: A transport mechanism where the transport protocol guarantees that messages submitted to be sent will be received by the target transparently to the application recovering from transient network and network adapter failures. This is typically accomplished in the art by ensuring that every packet sent is acknowledged by the receiver and the sender retransmit the packet if an acknowledgment is not received in a well defined interval of time. The interval of time is a function of the efficiency parameters of the system (node, processor, network, etc.).
Zero Copy Transport: A mechanism for message passing where the DMA (direct memory access) engines (possibly on the network adapter connecting the node to the network) are programmed to directly move data from system (node) memory into the network on the sending side and from the network into system memory directly on the receiving side without the involvement of the CPU (central processing unit) on the node in the movement of data at either end. This mechanism frees up the CPA on the node from the data movement aspects of protocol processing. This is also sometimes loosely referred to as Direct Memory Access method.
The present application also hereby incorporates by reference the entire contents of application Ser. No. 09/619,053 filed concurrently herewith.
In accordance with a preferred embodiment of the present invention, a method for transmitting a data packet stored in a first data processing system directly into a list of address in a second data processing system comprises a plurality of steps starting with providing the data packet in the first processor (sender) with a header which includes a tag which is associatable with a real address (possibly a list of real addresses) within the second processor (receiver). This data packet is transmitted with a header to the network adapter which is coupled to the receiver via an adapter which is coupled at the sender. This network adapter is provided with the mapping between the tag in the header with a real address (or possibly a list of real addresses) within the memory of the receiving system. Data in this data packet is transferred from the adapters to real address locations in the memory of the second system via direct memory access (DMA) (i.e. by programming the DMA engines typically on the network adapter to effect the movement of data). An acknowledgment is then transmitted back to the sender indicating that successful receipt of the data packet has occurred. If the first process (or system the sender) does not detect that an acknowledgment has been received, it transmits to the receiving process (or system) a data packet which includes a retransmit flag bit which is set so as to indicate the sender""s willingness to resend the data packet. Upon receipt of the data packet with the retransmit flag bit set, the second process (or system) does one of two things. If the second system detects that the data had already been received and had already sent an acknowledgment, then it is only necessary that the second system resend a data packet with a header indicating acknowledgment of the original receipt (this is the case when the original acknowledgment packet was lost in the network). Otherwise the receiving system never received the zero copy packet. At this time and only at this time would the second process (or system) transmit to the sending first process a request for retransmission. And most importantly, this request for retransmission would not be sent by the second process until it has established that tag association in the adapter can still take place with respect to the data packet which is anticipated to be received a second time from the sending process. Thus in accordance with the present invention a lost acknowledgment from the second process does not necessarily result in the retransmission of the same data. Furthermore, retransmission of data occurs now only at a time when DMA transfer is possible. This insures that most of the bandwidth between processors is used only for necessary communications. It further assures that the retransmission of the data packet is going to be successful and in particular, it assures that data will be written into the correct address space and into the correct memory locations within that address space.
Accordingly, it is an object of the present invention to provide a method for transmitting data directly from an address space of one process into the address space of another process via direct memory access (zero copy).
It is a further object of the present invention to ensure extremely rapid data transfer rates between systems.
It is also an object of the present invention to ensure that any and all direct memory access occurring from a different system occurs correctly without corrupting data.
It is a still further object of the present invention to ensure address space integrity in a receiving data processing system.
It is also an object of the present invention to minimize retransmission of data packet requests. The same zero copy data packets are never received twice by the receiver.
It is yet another object of the present invention to take full advantage of direct memory access procedures and techniques typically available on the network adapters.
It is also an object of the present invention to ensure that data is not transmitted from one system to another without the second system being fully prepared for its receipt and in particular fully prepared for DMA operations to user""s buffers.
It is a still further object of the present invention to most efficiently handle the problems associated with lost acknowledgment transmissions.
It is a still further object of the present invention to enable users running applications in their own address spaces on one data processing system to be able to transfer data accurately and efficiently into a user""s address space in another data processing system whether or not that data processing system is remote or in fact contained within the same physical package or frame.
It is also an object of the present invention to facilitate the protocol that it known as reliable zero copy transport.
It is yet another object of the present invention to ensure that memory locations where data is to be received in a receiving process are fully available for the intended data transfer when retransmission of data is undertaken.
Lastly, but not limited hereto, it is an object of the present invention to establish a data communication protocol that takes full advantage of direct memory access capabilities.
The recitation herein of a list of desirable objects which are met by various embodiments of the present invention is not meant to imply or suggest that any or all of these objects are present as essential features, either individually or collectively, in the most general embodiment of the present invention or in any of its more specific embodiments.