1. Field of the Invention
The present invention relates to a method, system, and program for managing data transmission through a network.
2. Description of Related Art
In a network environment, a network adaptor on a host computer, such as an Ethernet controller, Fibre Channel controller, etc., will receive Input/Output (I/O) requests or responses to I/O requests initiated from the host. Often, the host computer operating system includes a device driver to communicate with the network adaptor hardware to manage I/O requests to transmit over a network. The host computer further includes a transport protocol driver which packages data to be transmitted over the network into packets, each of which contains a destination address as well as a portion of the data to be transmitted. Data packets received at the network adaptor are often stored in an available allocated packet buffer in the host memory. The transport protocol driver processes the packets received by the network adaptor that are stored in the packet buffer, and accesses any I/O commands or data embedded in the packet.
For instance, the transport protocol driver may implement the Transmission Control Protocol (TCP) and Internet Protocol (IP) to encode and address data for transmission, and to decode and access the payload data in the TCP/IP packets received at the network adaptor. IP specifies the format of packets, also called datagrams, and the addressing scheme. TCP is a higher level protocol which establishes a connection between a destination and a source. A still higher level protocol, Remote Direct Memory Access (RDMA) establishes a higher level connection and permits, among other operations, direct placement of data at a specified memory location at the destination.
A device driver can utilize significant host processor resources to handle network transmission requests to the network adaptor. One technique to reduce the load on the host processor is the use of a TCP/IP Offload Engine (TOE) in which TCP/IP protocol related operations are implemented in the network adaptor hardware as opposed to the device driver, thereby saving the host processor from having to perform some or all of the TCP/IP protocol related operations. The transport protocol operations include packaging data in a TCP/IP packet with a checksum and other information, and unpacking a TCP/IP packet received from over the network to access the payload or data.
FIG. 1 illustrates a stream 10 of TCP/IP packets which are being sent from a source host to a destination host in a TCP connection. In the TCP protocol as specified in the industry accepted TCP RFC (request for comment), each packet is assigned a unique sequence number. As each packet is successfully sent to the destination host, an acknowledgment is sent by the destination host to the source host, notifying the source host by packet sequence number of the successful receipt of that packet. Accordingly, the stream 10 includes a portion 12 of packets which have been both sent and acknowledged as received by the destination host. The stream 10 further includes a portion 14 of packets which have been sent by the source host but have not yet been acknowledged as received by the destination host. The source host maintains a TCP Unacknowledged Data Pointer 16 which points to the sequence number of the first unacknowledged sent packet. The TCP Unacknowledged Data Pointer 16 is stored in a field 17a, 17b . . . 17n (FIG. 2) of a Protocol Control Block 18a, 18b . . . 18n, each of which is used to initiate and maintain one of a plurality of associated TCP connections between the source host and one or more destination hosts.
The capacity of the packet buffer used to store data packets received at the destination host is generally limited in size. In accordance with the TCP protocol, the destination host advertises how much buffer space it has available by sending a value referred to herein as a TCP Window indicated at 20 in FIG. 1. Accordingly, the source host uses the TCP Window value to limit the number of outstanding packets sent to the destination host, that is, the number of sent packets for which the source host has not yet received an acknowledgment. The TCP Window value for each TCP connection is stored in a field 21a, 21b . . . 21n of the Protocol Control Block 18a, 18b . . . 18n which controls the associated TCP connection.
For example, if the destination host sends a TCP Window value of 128 KB (kilobytes) for a particular TCP connection, the source host will according to the TCP protocol, limit the amount of data it sends over that TCP connection to 128 KB until it receives an acknowledgment from the destination host that it has received some or all of the data. If the destination host acknowledges that it has received the entire 128 KB, the source host can send another 128 KB. On the other hand, if the destination host acknowledges receiving only 96 KB, for example, the host source will send only an additional 96 KB over that TCP connection until it receives further acknowledgments.
A TCP Next Data Pointer 22 stored in a field 23a, 23b . . . 23n of the associated Protocol Control Block 18a, 18b . . . 18n, points to the sequence number of the next packet to be sent to the destination host. A portion 24 of the datastream 10 between the TCP Next Data Pointer 22 and the end 28 of the TCP Window 20 represents packets which have not yet been sent but are permitted to be sent under the TCP protocol without waiting for any additional acknowledgments because these packets are still within the TCP Window 20 as shown in FIG. 1. A portion 26 of the datastream 10 which is outside the end boundary 28 of the TCP Window 20, is not permitted to be sent under the TCP protocol until additional acknowledgments are received.
As the destination host sends acknowledgments to the source host, the TCP Unacknowledged Data Pointer 16 moves to indicate the acknowledgment of additional packets for that connection. The beginning boundary 30 of the TCP Window 20 shifts with the TCP Unacknowledged Data Pointer 16 so that the TCP Window end boundary 28 also shifts so that additional packets may be sent for the connection.
FIG. 3 illustrates a plurality of RDMA connections 50a, 50b . . . 50n, between various software applications of a source host and various storage locations of one or more destination hosts through a network. Each RDMA connection 50a, 50b . . . 50n runs over a TCP connection. In the RDMA protocol, as defined in the industry accepted RDMA RFC (request for comment), each RDMA connection 50a, 50b . . . 50n includes a queue pair 51a, 51b . . . 51n comprising a queue 52a, 52b . . . 52n which is created by a software application which intends to send messages to be stored at a specified memory location of a destination host. Each application queue 52a, 52b . . . 52n stores the messages to be sent by the associated software application. The size of each queue 52a, 52b . . . 52n may be quite large or relatively small, depending upon the number of messages to be sent by the associated application.
Each queue pair 51a, 5b . . . 51n of each RDMA connection 50a, 50b . . . 50n further includes a network interface queue 60a, 60b . . . 60n which is paired with the associated application queue 52a, 52b . . . 52n of the software applications. The network interface 62 includes various hardware, typically a network interface card, and various software including drivers which are executed by the host. The network interface may also include various offload engines to perform protocol operations.
In response to a request from an application to send messages to be stored at a specified destination host memory address at the other end of one of the RDMA connections 50a, 50b . . . 50n, a network interface 62 obtains a message credit designated an “empty message” from a common pool 64 of empty messages. The size of the pool 64, that is, the number of messages which the network interface 62 can handle, is typically a function of the hardware capabilities of the network interface 62. If an empty message is available from the pool 64, a message is taken from the application queue 52a, 52b . . . 52n of the requesting application and queued in the corresponding network interface queue 60a, 60b . . . 60n of the queue pair 51a, 51b . . . 51n of the particular RDMA connection 50a, 50b . . . 50n. The messages queued in the network interface queues 60a, 60b . . . 60n are sent over the network to the specified memory addresses of the destination hosts which acknowledge each message which is successfully received and stored at the specified memory address. Messages sent but not yet acknowledged are referred to herein as “uncompleted sent messages.” Once a message is acknowledged as successfully received and stored by the destination host, an empty message is restored or replenished in the pool 64 of empty messages.
In accordance with the RDMA protocol, the total number of messages queued in all the network interface queues 60a, 60b . . . 60n plus the total number of uncompleted messages sent by all the RDMA connections 50a, 50b . . . 50n typically is not permitted to exceed the size of the pool 64 of empty messages. Once one of the RDMA connections 50a, 50b . . . 50n reaches the limit imposed by the pool 64 of empty messages, no more RDMA messages from any of the connections 50a, 50b . . . 50n may be queued and sent until additional acknowledgments are received.
Notwithstanding, there is a continued need in the art to improve the performance of connections.