1. Technical Field
The present invention relates generally to the field of computer systems and, more specifically to a data processing system, method, and product for managing data transfers in a network.
2. Description of Related Art
Many existing computer systems use a shared-bus architecture, such as Peripheral Component Interconnect (PCI), as a means of transmitting data internally within the computer system among the system's various processors and I/O devices. These existing shared-bus architectures have not kept pace with the increase in the performance of typical processors. Thus, a new architecture, commonly called “Infiniband”, has been developed for transmitting data among processors and I/O devices internally within a computer system. This new architecture is capable of providing greater bandwidth and increased expandability.
The new architecture provides a system-area network which includes a channel-based, switched-fabric technology. In such a system-area network (SAN), data is transmitted via messages which are made up of packets. Each device, whether it is a processor or I/O device, includes a channel adapter. The messages are transmitted from one device's channel adapter to another device's channel adapter via switches. Each channel adapter may also be referred to as an “end node”.
FIG. 1 depicts two end nodes, each including a queue pair, in accordance with the prior art. When end node A 100 needs to transmit data to end node B 108, a logical connection is established between a queue pair included in end node A and a queue pair included within end node B. Data is then transmitted from the send queue of the queue pair in end node A to the receive queue of the queue pair in end node B. Responses are transmitted from the send queue of the queue pair in end node B to the receive queue of the queue pair in end node A. End node A 100 includes a queue pair 102. Queue pair 102 includes a send queue 104 and a receive queue 106. End node B 108 includes a queue pair 110. Queue pair 110 includes a receive queue 112 and a send queue 114. Requests are sent from send queues to receive queues and responses are sent from receive queues back to send queues. Request 116 is acknowledged by response 118. Request 120 is acknowledged by response 122.
Messages, and thus packets, may be transmitted utilizing one of five different transport types: Reliable Connected (RC), Reliable Datagram (RD), Unreliable Connected (UC), Unreliable Datagram (UD), or Raw Datagram (RawD). When the Reliable Connected transport type is used, sequence numbers are included in each packet, and packet transfers are acknowledged.
Starting sequence numbers are established when a logical connection is established between two end points. Each time a packet is transmitted, the sequence number is incremented and included within the packet. Thus, a packet's sequence number is used to identify the position of the packet within a sequence of packets.
In the prior art, a particular set of bits, or a field, is included in each packet to indicate the sequence number. Thus, the sequence number is this entire set of bits.
The packet sequence number (PSN) that is included in request 116 is the same as the PSN that is included in response 118. The PSN that is included in request 120 is the same as the PSN that is included in response 122. The PSN included in request 116 and response 118 has no relationship to the PSN included in request 120 and response 122 even though they are all using the same set of queue pairs.
Normally, the requester node increments the PSN by one in each request packet transmitted. The responder node compares the PSN in the received requests to its own PSN (expected PSN) that the responder also increments by one each time a request packet is received. If the PSNs match, the responder may then send a response to the request (acknowledgment) using the same PSN that was included in the request packet being acknowledged. Then back at the requester, the PSN in the response packet is compared to the requester's response PSN to see if it is the same as its own response PSN (expected PSN) that the requester also increments by one.
The requester is allowed to send multiple packets without receiving a response packet. The response packets may be received by the requester some time later, but the PSNs in these response packets are compared to the requester's response PSN counter. If all of the request packet PSNs match the responder's internal PSN and all of the response packet PSNs match the requester's internal PSN, all of the packets have been successfully transferred from one end node to another (from a send queue to a receive queue).
There are two abnormal conditions that must be detected and resolved at the responder to ensure reliable operation. The first condition is the duplicated packet, and the second condition is the invalid packet.
Duplicated packets are detected at the responder when the requester sends a request packet more than once. The requester will send packets more than once when it detects that the packet may have been lost. FIG. 2 illustrates a ladder diagram which depicts the transmission of duplicate packets in accordance with the prior art. Request packet 204, which includes a PSN=1, is transmitted by end node 200 and is received by the responder, end node 202. The response, acknowledgment 206 which includes a PSN=1, is either lost or delayed. In this case the requester, end node 200, detects a time-out condition and resends the same request as request 208 which includes the same PSN (PSN=1). The responder, end node 202, determines that the PSN is a duplicate (i.e. it has a PSN ‘earlier’ than end node's 202 internal count), and the responder sends the response again as acknowledgment 210 with the same PSN (PSN=1).
An invalid packet is detected at the responder when the responder receives a packet with a PSN ‘ahead’ of its internal count. FIG. 3 illustrates a ladder diagram which depicts the receipt of an invalid packet in accordance with the prior art. The requester, end node 300, transmits a request 304 which includes a PSN=1, a request 308 which includes a PSN=2, and a request 310 which includes a PSN=3. Request 304 is properly acknowledged by acknowledgment 306, which includes a PSN=1. Request 308, which includes a PSN=2 is lost in the fabric. Thus, responder, end node 302, sees request 304, having PSN=1, followed by request 310, having PSN=3. Thus, request 310 is an invalid packet. In this case, responder, end node 302 resends the acknowledgment 312 for request packet with PSN=1, and the requester resends all packets starting with request 314 having PSN=2.
At the requester, the response packets have similar rules. The duplicate packet detected by the requester is discarded. This case can only occur when a request packet is not lost but is only delayed in the fabric long enough for the requester to resend it. The second response with the duplicate PSN is discarded. The invalid PSN at the requester can occur when one or more packets in a multiple packet response are lost in the fabric. In this case, the requester resends the request.
PSNs use a fixed and finite number of bits in the transport header which is included in each packet. Therefore, the PSNs are continually reused as the counters generating them wrap from their maximum value back to zero. By using a PSN that is much larger than the number of packets that may be outstanding, requesters and responders establish a range of packet sequence numbers in the duplicate and invalid ranges.
A problem arises with PSNs when a logical connection between two end nodes is terminated (torn down) and then reestablished while packets are in flight. In this case, a packet from the old, stale connection may arrive at the responder. The responder may interpret this packet as a valid packet when it is actually a stale packet from the old connection.
One of the solutions described in the prior art is to add wait states between tearing down a logical connection established between two particular sets of queue pairs and then reestablishing the logical connection between these same two sets of queue pairs. Thus, the end nodes wait long enough for all possible stale packets from the old connection to expire. Although this solution does solve the problem, it can significantly affect the end nodes' performance, especially when connections are often torn down and then reestablished.
Therefore, a need exists for a method, system, and product for efficiently managing data transfers in a network.