In recent years there has been a proliferation in the networking of computer systems. The recent expansion of the Internet is just one example of the trend toward distributed computing and information sharing. In most forms of computer or communication networking there are communication paths between the computers in the networks. These paths may include multiple links or hops between intermediate equipment in a path. Thus, a communication may be originated by a first computer and pass through several links before reaching the destination computer. The control over these communications is typically carried out under a networking architecture. Many networking architectures exist for defining communications between computers in a network. For example, System Network Architecture (SNA) and Transfer Control Protocol/Internet Protocol (TCP/IP) are two examples of existing network architectures.
One existing network architecture for controlling communications between computers is known as Advanced Peer to Peer Networking (APPN). APPN, like many networking architectures, is based upon the transmission of data packets where a communication is broken into one or more “packets” of data which are then transmitted from the source to the destination over the communication path. Packet based communications allows for error recovery of less than an entire communication which improves communication reliability and allows for packets to take multiple paths to an end destination thus improving communicaton availability.
One error condition which many networks attempt to correct for is packet loss. Packet loss in a network may be broadly characterized as resulting from congestion on the path from the source to the destination or from loss of data (bit error) by links in the path. Congestion may result from too high a data packet rate for a path. Bit error may, however, result from any number of failures in a communication link. For example, sun spots may adversely impact microwave transmissions and cause loss of data. However, bit error occurrences are generally highly correlated. As a result, a time averaged bit error rate (BER) alone may not accurately describe line quality. Line quality is, therefore, usually described using a combination of an average BER over some time period along with the number of seconds in the time period in which one or more bit errors occur.
While APPN has proven to be a reliable networking architecture, as computer networking demands have increased these demands have created a demand for network architectures which utilize the higher performance communication systems and computer systems currently available. In part because of these demands, High Performance Routing, which is an enhancement to APPN, was developed. Processing capability has increased and become less expensive. This has driven the need for larger peer-to-peer networks. Link technology has advanced by several orders of magnitude over the past decade. Advances in wide area links have dramatically increased transmission rates and decreased error rates. Thus, to take advantage of these advances HPR provides high speed data routing which includes end-to-end recovery (i.e. error recovery is performed by the sending and receiving systems) and end-to-end flow and congestion control where the flow of data is controlled by the sending and receiving systems.
HPR consists of two main components: the Rapid Transport Protocol (RTP) and automatic network routing (ANR). RTP is a connection-oriented, full-duplex transport protocol designed to support high speed networks. One feature of RTP is to provide end-to-end error recovery, with optional link level recovery. RTP also provides end-to-end flow/congestion control. Unlike TCP's reactive congestion control, RTP provides an adaptive rate based mechanism (ARB).
ARB provides end-to-end flow control to prevent buffer overrun at the RTP endpoints, a rate based transmission mechanism that smooths input traffic and a preventive congestion control mechanism that detects the onset of congestion and reduces the RTP send rate until the congestion has cleared. The ARB preventive congestion control mechanism attempts to operate the network at a point below the “cliff” (shown in FIG. 1) and to prevent congestion. A reactive mechanism, on the other hand, detects when the network has entered the region of congestion and reacts by reducing the offered load.
In RTP, the ARB mechanism is implemented at the endpoints of an RTP connection. Each endpoint has an ARB sender and an ARB receiver. The ARB sender periodically queries the receiver by sending a rate request to the ARB receiver who responds with a rate reply message. The sender adjusts its send rate based on information received in the rate reply message.
The mechanism used to control the send_rate is as follows. A burst_size parameter sets the maximum number of bytes a sender can send in a given burst at a given send_rate. During each burst_time, defined by burst_size/send_rate, a sender is allowed to send a maximum of burst_size bytes. The receiver continuously monitors network queuing delay looking for the initial stages of congestion. Based on this assessment and also based on the current state of the receiver's buffers, the receiver sends a message to the sender instructing it to either increment the send_rate by a rate increment, keep the send_rate the same, decrement the send_rate by 12.5%, decrement the send_rate by 25%, or decrement the send_rate by 50%.
The receiver initiates error recovery as soon as it detects an out of sequence packet by sending a gap detect message that identifies the packets that need to be resent. When the sender receives a gap detect message, it drops its send_rate by 50% and resends the packets at the next send opportunity. If the sender does not get a response to a rate request within a time-out period, the sender assumes the packet is lost and cuts the send_rate by half, increases the rate request time-out exponentially (exponential back off), and transmits a rate request at the next send opportunity.
Thus, like many forms of networking, in RTP packet losses are assumed to result from congestion rather than bit errors. Such an assumption may often be valid for modern digital wide area links which exhibit low loss rates. However, these loss rates may not apply to all communication links around the world or even to high quality links all the time.
Furthermore, as RTP provides end-to-end flow control, the send rate of packets on a path may be limited by the slowest link in the path (i.e., the bottle-neck link). Thus, despite a path having high-speed links in the path if a single low-speed link is present, the sender and receiver will pace the transmission of packets to accommodate the low speed link. Thus, a congestion problem or the presence of one low speed link in a path may degrade the throughput for the entire path.
One way to improve congestion problems or to compensate for differing transmission rates on a communications path is to provide for multiple links between connection points that may be the bottle-neck in the path. HPR provides for such concurrent links through a Multilink Transmission Group (MLTG). Similarly, TCP/IP provides ofr multiple links with multi-link Point to Point Protocol (PPP). A transmission group is a logical group of one or more links between adjacent nodes that appears as a single path to the routing layer. A MLTG is a transmission group that includes more than one link. Links in a MLTG are referred to herein as sublinks. An MLTG can include any combination of link types (e.g., token-ring, SDLC, frame relay). MLTGs provide increased bandwidth which may be added or deleted incrementally on demand. Furthermore, the combined full bandwidth is available to a session since session traffic can flow over all sublinks in the group. MLTGs also provide increased availability. An individual sublink failure is transparent to sessions using the MLTG.
One drawback of an MLTG is that packets flowing over an MLTG can arrive at the RTP endpoint out of sequence. Thus, RTP must know if an MLTG is in a path. At connection establishment, RTP learns if there is an MLTG in the path. If an MLTG is not in the path, any data received that is out of sequence causes error recovery (i.e., the receiver sends a gap detect message to the sender). If an MLTG is in the path, error recovery is delayed. When the receiver detects out of sequence packets, it initiates a time-out procedure before sending the gap detect message. The time-out procedure allows enough time for all packets to arrive before initiating recovery.
The addition of an MLTG to a path also requires the endpoints of the MLTG to schedule packets to the sublinks of the MLTG. This distribution of packets among the concurrent links is presently accomplished in a number of ways, including round-robin, weighted round-robin and link metered pacing approaches. In a round-robin approach packets are distributed to sublinks in the MLTG by a simple sequential distribution to the links. This approach, however, does not take into account the possibility of differing link rates as well as possible congestion on a link or bit errors on a link in the MLTG.
In the weighted round-robin scheme, the scheduler maintains a count field for each sublink. Going in a fixed (round robin) order, the scheduler assigns a first group of packets to a first sublink, then assigns a second group of packets to a second sublink and so on through all of the links. The count field for a sublink is incremented each time a packet has been assigned to it. Once the count field equals the weight of the sublink, the scheduler moves on to the next sublink in the list. The weight values determine the relative frequency of use of each sublink by the MLTG scheduler. For example, if an MLTG consists of 2 sublinks with weights of 1 and 2 respectively, then the sublink with weight 2 will be allocated twice as much data as the other sublink. However, if the right mixture of dynamics does not exist, it is possible that the flow distribution over the sublinks will deviate from the optimal flow specified by the weights. For example, if small packets flow over one link while large packets flow over another link, the result will be sub optimal RTP throughput (a similar effect occurs if the sublink weight values are incorrect). Furthermore, if loss occurs on one of the sublinks, there is no mechanism to account for the change in throughput of the sublink.
For example, as seen in FIG. 2, at a sustained BER of 10−6, an RTP connection over a single 1500000 BPS link would have an effective throughput of 100000 BPS. With a 2 link MLTG, if one 750,000 BPS link experienced a sustained BER of 10−6, the RTP throughput would be roughly 250000 BPS. The error free link would be significantly underutilized (less than 25%). The solid “O” curve in FIG. 2 illustrates the results of a simulation of RTP performance over an MLTG with two sublinks. The curve illustrates one of several problems associated with running RTP over MLTG. At some point, in this case at a BER of about 3*10−7, RTP performs worse than if there was just a single (well behaved) link. This inefficiency follows from each packet loss resulting in a send_rate reduction of 50% to both links in the MLTG.
Furthermore, with any weight based MLTG scheduling system the algorithm is dependent on accurate weight values. A weighted round-robin algorithm requires static weights that must be as close to optimal as possible. The weight values typically are based on link speeds and provide a simple way to load balance the flow over the sublinks. Inaccuracy in weighting may be a significant problem given the number of multiprotocol link and subnet technologies (e.g., PPP, X.25, multiprotocol encapsulation over frame relay, multiprotocol encapsulation over ATM AAL5), it may be impossible to know the exact throughput available to a particular protocol over a multiprotocol link layer. Consequently, it may be impossible to know the correct weight values that should be assigned to each sublink.
An incremental extension to weighted round-robin MLTG scheduling adds a simple check before the scheduler assigns a packet to a sublink. If the sublink is in error recovery, it will not be used until the link has recovered. To implement this, the MLTG scheduler must monitor when a sublink goes in and out of error recovery state. If the sublink is in error recovery, the packet is submitted to another available sublink. If all links are in recovery, the packet is queued in an MLTG queue until a sublink is available. However, such error recovery may provide minimal improvement over the simple weighted round-robin method. By the time it is learned that a sublink is in recovery, it is too late. The scheduler might have scheduled many packets to the sublink. Also, when operating over a lossy sublink, the link may toggle in and out of error recovery frequently.
The next MLTG scheduling method, which is referred to as link metered pacing, is based on the SEND_MU signal defined by SNA architecture. The Data Link Control layer (DLC) issues a SEND_MU signal to Path Control when it is ready to accept additional frames for transmission. The mechanism allows component level pacing between the DLC and Path Control layers. An Error Recovery Protocol (ERP) DLC typically issues a SEND_MU after one or more frames have been successfully acknowledged. The SEND_MU signal provides the mechanism by which the MLTG scheduler sends a controlled amount of data to a sublink (call this amount the MAX_TOKENS) and then waits for a request for more data. The idea is to keep enough data queued in the DLC to keep the transmitter busy, but to have an upper bound so that the DLC queue level is controlled. If a link goes into error recovery (ER), the queue buildup occurs in the MLTG queue allowing RTP to quickly detect and react to the congestion. Therefore, link metered pacing avoids the queue explosion that can occur with the round-robin methods.
In one manner of implementing link metered pacing, MLTG maintains a MAX_TOKENS variable for each sublink in the transmission group that represents the maximum number of packets that can be sent to a sublink DLC at any time. A PACING_TOKEN_COUNT variable tracks the number of available tokens at any time. The count is initially set to the MAX_TOKENS value. The MLTG scheduler decrements the count as it assigns packets to a sublink. To ensure even flow over all sublinks, the scheduler implements a simple round robin scheduling policy for sublinks that have not run out of tokens. Once a sublink's PACING_TOKEN_COUNT reaches 0, MLTG stops using the sublink. Once a sublink is out of tokens, any other sublink with tokens is used, even if this means violating the round robin sequence.
The sublink DLC has a DLC_SEND_COUNT variable. Each time a frame is acknowledged, the count is incremented. Once the DLC_SEND_COUNT reaches a threshold (call-this the DLC_THRESHOLD), the DLC increments the PACING_TOKEN_COUNT by the DLC_THRESHOLD value. The DLC_SEND_COUNT is then reset to 0. As an alternative to a counting technique, a sublink DLC can implement its part of the link metered pacing mechanism by issuing the SEND_MU after each time it completes transmission of a packet from its transmit queue (rather than from a retransmit queue). If a sublink DLC goes into error recovery, it draws packets from its retransmit queue. Thus, there is a natural pacing mechanism that stops the flow of packets from MLTG to the sublink DLC when the sublink link experiences delays due to recovery.
The dashed “+” curve in FIG. 3 illustrates simulation results for a link metered pacing method where bit error loss is present on one of the sublinks. As seen in FIG. 3, RTP throughput collapses in the range of 10−5. The results show significant improvement over the round robin method (the solid “+” curve illustrates a reound robin scheduling method with error recovery enabled and the solid “0” illustrates a round robin scheduling method where error recovery is disabled). However, the throughput of the MLTG still falls below that of using a single sublink if the bit error rate is large enough. An optimized value of MAX_TOKENS may be utilized to improve performance, but this value still depends on statically configured link speed and propagation delay estimates. Obtaining accurate estimates may be difficult without a dynamic measurement. Also, as link quality deteriorates, the original MAX_TOKENS value is no longer optimal.