1. Field of the Invention
The present invention relates generally to data transfers in data processing network systems, and in particular to transfer of data blocks over the Internet and similar networks. Still more particularly, the present invention relates to improved TCP network communications during retransmission.
2. Description of the Related Art
The Internet has become an important conduit for transmission and distribution of data (text, code, image, video, audio, or mixed) and software. Users connect to the backbone with broadly divergent levels of performance, ranging from 14.4 Kb/s to more than 45 Mb/s. Moreover, Transmission Control Protocol/Internet Protocol (TCP/IP) has become a widely implemented standard communication protocol in Internet and Intranet technology, enabling broad heterogeneity between clients, servers, and the communications systems coupling them. Transmission Control Protocol (TCP) is the transport layer protocol and Internet Protocol (IP) is the network layer protocol. TCP builds a connection oriented transport level service to provide guaranteed, sequential delivery of a byte stream between two IP hosts. Application data is sent to TCP, broken into segments sequenced by segment numbers, and packetized into TCP packets before being sent to IP. IP provides a “datagram” delivery service at the network level. Reliability in data transmission can be compromised by three events: data loss, data corruption, and reordering of data.
Data loss is managed in TCP/IP by a time-out mechanism. TCP maintains a timer (retransmission timer) to measure the delay in receiving an acknowledgment (ACK) of a transmitted segment from the receiver. When an ACK does not arrive within an estimated time interval (retransmission time-out (RTO)), the corresponding segment is assumed to be lost and is retransmitted. Further, because TCP is traditionally based on the premise that packet loss is an indication of network congestion, TCP will back-off its transmission rate by entering “slow-start,” thereby drastically decreasing its congestion window to one segment.
TCP manages data corruption by performing a checksum on segments as they arrive at the receiver. On checksum, the TCP sender computes the checksum on the packet data and puts this 2-byte value on the TCP header. The checksum algorithm is a 16-bit one's complement of a one's complement sum of all 16-bit words in the TCP header and data. The receiver computes the checksum on the received data (excluding the 2-byte checksum field in the TCP header) and verifies that it matches the checksum value in the header. The checksum field also includes a 12-byte pseudo header that contains information from the IP header (including a 4-byte “src ip” address, 4-byte “dest ip” address, 2-byte payload length, 1-byte protocol field).
TCP manages reordering of data or out-of-order arrival of segments by maintaining a reassembly queue that queues incoming packets until they are rearranged in sequence. Only when data in this queue gets in sequence is it moved to the user's receive buffer where it can be seen by the user. When the receiver observes a “hole” in the sequence numbers of packets received, the receiver generates a duplicate acknowledgement (DACK) for every “out-of-order” packet it receives. Until the missing packet is received, each received data packet with a higher sequence number is considered to be “out-of-order” and will cause a DACK to be generated.
Packet reordering is a common occurrence in TCP networks given the prevalence of parallel links and other causes of packet reordering. For instance, on Ether-channel® provided by Cisco Systems, Inc., a number of real adapters are aggregated to form a logical adapter, whereby packet reordering is commonly caused when packets are sent in parallel over these multiple adapters. In TCP, any data packets following one that has been lost or reordered are queued at the receiver until the missing packet arrives. The receiver then acknowledges all the queued packets together.
Because TCP will wrongly infer that network congestion has caused a packet loss after the sender receives a few DACKs, some TCP implementations have adopted a “fast retransmit and recovery” algorithm to improve network performance in the event packet reordering occurs. The “fast retransmit and recovery” algorithm is generally intended to improve TCP throughput by avoiding a time-out, which results in the dramatic reduction of the congestion window to one segment. Instead of timing out, fast retransmit cuts the congestion window in half in response to reordering.
Although fast retransmit does provide some protection against throughput reduction caused by congestion control mechanisms, multiple packet losses can have a catastrophic effect on TCP throughput. TCP is generally a cumulative acknowledgment scheme in which received segments not at the edge of the receive window are not explicitly acknowledged. This forces the sender to either wait for a round trip time to find out about each lost packet, or to unnecessarily retransmit segments that have been correctly received. Selective Acknowledgement (SACK) is a TCP mechanism devised to overcome this problem. SACK permits the data receiver to inform the sender about all segments that have arrived successfully, so the sender need retransmit only the segments that have actually been lost. Moreover, a single SACK lets a data receiver report multiple blocks of missing data to the sender.
While SACK has been effective in reducing the unnecessary retransmissions, its use creates its own inefficiencies. For example, when the size of the packets of the original transmission that were lost are much smaller than the Maximum Segment Size (MSS) of the TCP connection (for instance when TCPs Nagle algorithm for data coalescing at the transport layer is turned off by the application), and the holes created by the missing packets are small (i.e., the packets dropped are not contiguous), the retransmissions will be sent with data less than the MSS because a TCP segment can contain only contiguous data. Thus, in the example where multiple noncontiguous packets are to be retransmitted in response to a SACK, the TCP payloads of the retransmissions will contain the small packets, leaving the remaining portion of the MSS unused, even though the sender has data exceeding the MSS to resend. This forces the sender to send multiple IP packets of under-utilized payload space in response to SACKs. As will be appreciated, this will cause a negative impact on performance by increasing network traffic and IP/TCP processing at both the sender and receiver.
This problem is demonstrated in the example shown in FIGS. 9A and 9B. FIG. 9A shows a series of contiguous packets A1-A4 transmitted from a sender to a receiver. In this exemplary system, each of the packets of the original transmission are 4096 bytes, but the MSS for the TCP connection is 60 K bytes, so the receiver receives all four packets in a single IP packet. Now suppose due to data loss in the connection, the receiver has not received packets A2 and A4, and consequently has SACKed packets A1 and A3. The sender will send a first IP packet containing a payload B1, which is the retransmission of A2. Although the MSS of the IP payload is 60 K bytes, the packet only includes 4096 bytes of contiguous data contained in A2. The sender then sends a second IP packet containing a payload B2 of 4096 bytes, which is the retransmission of packet A4. As can be seen, even though the MSS is 60 K bytes for each IP packet, two IP packets are required to be transmitted to retransmit 8192 bytes in response to the SACK. As can be seen, it would be desirable to reduce this negative impact on throughput when performing retransmissions in response to a SACK.