This invention relates to digital packet transmission, and particularly to a method for fast, reliable byte stream transport in communication environments.
A computer network ties a number of computers, terminals and wireless devices together for the exchange of information. These computers, terminals and wireless devices are also called nodes of the network. The main protocol suite in use in computer networks, including the Internet, is TCP/IP. TCP stands for Transmission Control Protocol and IP stands for Internet Protocol. The IP protocol suite provides point-to-point datagram delivery and is potentially unreliable. These protocols are defined by the Internet Engineering Task Force, specifications available at www.ietf.org. The TCP protocol runs on top of IP and implements reliable end-to-end delivery of byte streams between nodes. In addition, TCP has facilities in place to ensure the in-order, reliable delivery of information.
TCP's View of the Network
TCP is designed to cope with networks that are potentially unreliable. In fact, TCP makes the following assumptions about the network: The network can drop packets due to intermittent faults or because of congestion, which lead to buffer overruns or long routing delays. The packets that make up the byte stream may get delivered out of the order in which they were transmitted at the source. Part or all of the packet data can get corrupted; if this happens to a packet, the packet is dropped. The amount of buffer space available at routers on the way to the destination or at the destination itself is unknown—the sender has to discover this dynamically and adjust the sending rate appropriately to avoid packet losses.
These assumed characteristics of the network have driven the features that are in today's TCP standard.
TCP's Artifacts for Coping with the Assumed Network Characteristics
To cope with these assumed characteristics of the network, TCP employs the following mechanisms to guarantee end-to-end reliable byte stream transport: (a) A retransmission mechanism based on the use of acknowledgments from the receiver and a timeout facility for a transmitted packet at the sender. The duration of this timeout period is dynamically updated to reflect the recently perceived delay in the network. (b) A window-based flow control mechanism to limit the number of packets a sender can transmit without receiving acknowledgments. The net effect is to really limit the number of packets in transit. (c) A congestion control mechanism that is integrated into the window mechanism to throttle the sender when packet losses are persistently perceived by the sender. The congestion control mechanism of TCP also allows the sender's transmission rate to ramp up subsequently when the level of congestion is eased. (d) TCP uses checksumming to guarantee data integrity. Actually two checksums, a TCP checksum and an IP checksum, are performed on all packets. The TCP checksum is computed over the TCP pseudoheader (made up of the IP addresses of the endpoints and the socket addresses), the IP header and the packet data. The independent IP checksum is also used to guarantee the integrity of the IP header. (e) A packet reassembly facility to collate received packets in proper order of the byte sequence.
Performance Implications
These mechanisms do not come for free: a substantial protocol overhead results when the above mechanism are used, manifesting in the form of high end-to-end delays.
Specifically, the overhead results from the following: 1. The cost of implementing the retransmission timer (as well as other timers not connected in any explicit way with timeout mechanism). This cost has several components: (i) The bookkeeping overhead for the timers, linking each individual transmitted packet into the queue of packets that have not been acknowledged (i.e., packets that may need to be retransmitted) and unlinking when the packets are acknowledged. (ii) The overhead for dynamically computing accurate estimates of the round trip time (RTT), whose value decides the duration of the retransmission timer. (iii) The overhead of hardware timer interrupts: All of the timers used by TCP are implemented in software using hardware timers for the ticks. To implement the software timers, each of the locations implementing such timers have to be decremented when the hardware “tick” timer generates an interrupt. The interrupt handling time is usually quite high. Notice that these timer manipulations triggered by hardware “tick” timer interrupts are done even when transmitted packets are acknowledged. 2. The bookkeeping overhead for the windowing mechanism. During routine transfers without errors or packet dropping, additional code is executed to monitor and update the current state of the connection. 3. The checksum computations (for the TCP and IP checksum) typically involves repeated movement of part or all of the packet data through the processor cache and the memory, resulting in serious performance degradation due to cache pollutions. This is particularly detrimental in modern CPUs where CPU clock rates continue to increase dramatically as memory systems speeds remain practically flat. In some implementations, the situation is aggravated when the checksums are computed incrementally in a distributed fashion. 4. On a packet loss, TCP initiates the retransmission of packets starting with the one that was not acknowledged. This results in the unnecessary retransmissions of packets that may have already been received and properly acknowledged.
This protocol overhead severely limits the latency and bandwidth of networks. When TCP was originally developed the software overhead was very small compared to the overall time because networks speeds were slow. Today this has changed; the speed of modern networks has dramatically increased relative to the processing power of networking nodes. Thus TCP software overhead is now a significant portion of the overall end-to-end communication delay. This relative increase in software overhead severely restricts the performance of modern networks and prevents the full potential of networking hardware from being realized. Even with its poor latency characteristics, TCP remains the networking protocol of choice due to its support for client-server applications, large installed base and its compatibility with legacy code. In fact, compatibility is often even more important than performance. For example, modern low-latency technologies such as ATM, implement TCP on top of their native protocols just to gain compatibility with existing networking software. To exploit the capabilities of modern high-end networking hardware, it is essential to reduce the overhead in the TCP protocol.
Over the years, some of the inefficiencies of TCP have been recognized and a variety of improvements to the protocol have been suggested. Some of the techniques proposed for speeding up TCP have showed up as Requests for Comments (RFCs) with the Internet Engineering Task Force (IETF accessible at URL: http:www.ietf.org) and are fairly well-known. What follows is a summary of the more common approaches taken to improve TCP performance.
1) SACKs: One well known technique is selective acknowledgements (SACKs) described in RFC 2018, “TCP Selective Acknowledgement Options”, by Mathis et al. Here a single SACK actually acknowledges the status of receiving a group of consecutive packets. By using a bit vector within the SACK, the sender is told explicitly the packets in the group that have been received properly and the ones that have been lost. The sender then (selectively) retransmits only the lost packets. Thus this technique improves the retransmission response time for lost packets. However this technique has two main inefficiencies: first, the bit vector has to be scanned to determine the identity of the lost packets; second, ACKs are explicitly sent and processed, with an associated timer management and bookkeeping overhead.
2) Negative Acknowledgements (NAKs or NACKs) and Larger Windows: In RFC 1106, “TCP Big Window and NAK Options”, by R. Fox, the use of NAKs and larger windows have been proposed to enhance the efficiency of connections that have a long bandwidth-delay product (such as satellite links). NAKs improve the retransmission response time for lost packets, but do not reduce overhead because ACKs are still used. Additionally, the NAKs used here are “advisory”, meaning that implementations can ignore it with no impact.
3) Delayed ACKs: In RFC 1122, “Requirements for Internet Hosts—Communication Layers”, edited by R. Braden, delays ACKs reduce processing demands by reducing the total number of ACKs sent. However, this has limited effect because timer management and bookkeeping overhead remains the same.
4) Reduced Number of ACKS: U.S. Pat. No. 6,038,606, “Method and Apparatus for Scheduling Packet Acknowledgements”, by Brooks et al., reduces the number of ACKs needed in TCP. During the initial slow start phase of TCP ACKs are sent for every two packets. Once the connection is running at full speed, ACKs are only sent for every W-2 consecutive packets; where W is the number of packets that fit in one window. The sender's timeouts must be set large enough so that they do not timeout for a full windows worth of packets. If congestion occurs the normal TCP ACK technique is used. This technique has limited impact on performance since timers for all packets are still maintained.
5) Delayed Processing: U.S. Pat. No. 5,442,637, “Reducing the Complexities of the Transmission Control Protocol for a High-Speed Networking Environment”, by M. Nguyen, cuts back on processing at the receiver by delaying processing of every N received packets. The receiver then processes all control information in these packets at once. This cuts down on the number of timers needed for each packet and improves performance. On the downside this causes the system to start up slower than usual. To counter this, a rate-based flow control is added to the system.
6) Smart ACKs: U.S. Pat. No. 5,245,616, “Technique for Acknowledging Packets”, by G. Olson, describes an ACK that contains a bit vector indicating the status of the current packet and the seven previous packets. If an ACK is lost due to an error on the line it is very likely that a subsequent ACK will contain information on this packet. Thus this redundant information prevents the sender from retransmitting when it is not needed. In addition, this vector is used to indicate that a packet was dropped and must be retransmitted. This reduces the amount of time needed to trigger a retransmission but it does not reduce timer overhead.
7) Sliding Window Adjustment Techniques: U.S. Pat. No. 6,219,713, “Method and Apparatus for Adjustment of TCP Sliding Window with Information about Network Conditions”, by J. Ruutu et al., describes a technique to modify TCP's sliding window based on load condition and traffic congestion for the network. Additionally, U.S. Pat. No. 6,205,120, “Method and Apparatus for Transparently Determining and Setting an Optimal Minimum Required TCP Window Size”, by Packer et al., transparently modifies a receiver window size based on network latency. These methods provide some performance improvement under certain conditions but they are still bound by the inefficiencies of TCP's windowing mechanism.
All of these mechanisms are piece-meal fixes to solve the inefficiencies associated with the windowing mechanism of TCP and thus have had limited success. None of these techniques reduce the overhead from TCP's windowing mechanism or retransmission timers. This overhead severely limits the latency and bandwidth of modern LANs. Thus there is a significant opportunity to design a reliable byte stream transport system that has significantly less overhead than TCP. In so doing, the full potential of modern low-latency network technologies can be attained.
One of main reasons for TCP's significant overhead is that its design is based on older unreliable network technology. Today's networking technologies are more reliable than the assumptions made by TCP. This is particularly the case in local area networks (LANs). In modern networking technologies, the following scenario exists: Packets are rarely dropped Packets are not delivered out-of-sequence Packets are rarely corrupted
Many of these scenarios are also valid for quality conscious switched networks larger than LANs. Thus it would be advantageous to take a more optimistic approach, consistent with the above observations for a modern networks, and provide a reliable byte stream transport system with less software overhead. This in turn would greatly improve end-to-end latency and effective bandwidth within modern networks.
It would also be advantageous to make this new transport system fully compliant with the current application programming interface (API) of TCP. This would allow all current client-server networking applications to run without any change or recompilation.
It would also be advantageous to provide a mechanism that can distinguish between packets meant for standard TCP and the new byte stream transport system and forward the data to the corresponding transport system. This would allow full interoperability with hosts running traditional TCP implementations or the new byte stream transport system.