1. Field of the Invention
The present invention relates generally to providing reliable data transmission in a computer network. More particularly, the present invention relates to providing an error recovery scheme that consumes minimal bandwidth.
2. Description of the Related Art
A computer network includes two or more agents (e.g., computers and other communication devices) that are connected with one another so that one agent is able to communicate data electronically with another agent by sending messages or data packets (or frames). In addition to providing individual physical connections between agents, a computer network establishes a cohesive architecture that allows the agents to transmit data in an organized fashion. Examples of computer networks include local-area networks (LANs) used in a typical office setting and wide-area networks such as the Internet.
Logically, the architecture of a computer network can be divided into three functionality layers: the physical layer, the link layer, and the protocol layer. The physical layer is responsible for electrical transfer of the data packet, the link layer provides (among other things) error-free message delivery and flow control, while the protocol layer carries out high-level functionalities, examples of which include cache coherence, interrupt delivery, and memory access ordering.
One of the key functions of the link layer is to recover from transmission errors. All data transmissions between agents in the network are vulnerable to be corrupted by noise in the communication channels. Because data corruption in a computer network is unavoidable, each agent must be able to detect when data in a packet has been corrupted and have a protocol or scheme for recovering from the error. While some error recovery schemes are able to correct errors by using error correction codes, such schemes generally require more overhead. Therefore, it is standard practice to detect and discard the corrupted data packet and have the source agent retransmit the corrupted data packet.
The link layer transforms a communication channel with transmission errors into one that appears free of transmission errors and delivers packets in the order they are sent. It accomplishes this task by having the sending agent organize the data into packets (typically a few hundred bytes) and transmit the data packets sequentially. With each packet, the receiving agent is able to check for errors (by checking parity, for example) and send an acknowledgment (ACK) back to the sending agent if the packet is received error-free. The ACK verifies to the sending agent that the data packet was successfully transmitted. After a certain amount of time (determined by the channel delay), if the sending agent does not receive an ACK for a particular data packet, it will assume that an error has occurred and retransmit the packet to the receiving agent.
This very basic protocol is known as stop and wait, which, as the name suggests, is highly inefficient. The sending agent may transmit only one data packet at a time to the receiving agent and must wait until it receives an ACK before transmitting the next data packet. If there is an error in either the data packet or the ACK, the original data packet must be re-sent before the next packet can be sent. A much more efficient protocol that is commonly used is known as the sliding window protocol, which pipelines the sending of packets and thus is able to “fill” the communication channel with packets in transit and maximize the transmission throughput.
FIG. 1 illustrates a computer network 10 that sends and receives data as a function of time in accordance with the sliding window protocol. Network 10 includes a sending agent 12 and a receiving agent 14, which are coupled to each other through two uni-directional channels 16 and 18. In this example, channels 16 and 18 have a length and capacity of ten data packets each. Assuming that network 10 has a global clock, a data packet sent by sending agent 12 along channel 16 will be received by receiving agent 14 ten clocks after it was sent. The same is true with an ACK transmitted by receiving agent 14 through channel 18 back to sending agent 12.
In the sliding window protocol, sending agent 12 assigns a data sequence number to each data packet to identify the packet, such as packet 0. When data packet 0 arrives at receiving agent 14 without being corrupted, receiving agent 14 transmits an ACK 0 (where in this case, the 0 is an expected sequence number) to communicate to sending agent 12 that data packet 0 has arrived. This simple scenario assumes that neither the data packet nor the ACK was corrupted.
Because data packets and their corresponding ACKs may be corrupted at any point in channels 16 and 18, sending agent 12 must maintain a retry queue that stores the packets it sent. If sending agent 12 does not receive an ACK for a particular packet within an amount of time that is greater than the round-trip delay, the packet is retrieved from the retry queue and re-transmitted. Clearly, sending agent 12 must have a scheme for determining when a packet in the retry queue is no longer needed, otherwise a retry queue of unbounded capacity would be needed. The scheme that the sliding window protocol uses is simple: when sending agent 12 receives an ACK carrying sequence number k, it knows that receiving agent 14 has received packet k, so sending agent 12 can remove all packets with sequence number no greater than k from its retry queue.
One major problem with the sliding window protocol is the bandwidth overhead incurred by the presence of two sequence numbers in every data packet. Firstly, a data packet must carry its own sequence number. Secondly, it must carry the sequence number of an ACK for the data traffic in the opposite direction. Therefore, in network 10, these two sequence numbers would consume 2 log2N bits of the bandwidth in each data packet, where N equals the total number of possible sequence numbers.
If the overhead of sequence numbers could be reduced from each data packet transmitted between agents, it would be possible either to reclaim wasted bandwidth or to reduce the cost of the communication channel by using fewer physical wires. Because data carrying wires are expensive, reducing the number of wires required to carry 2 log2N bits is very significant, particularly in long communication channels. Therefore, it is highly desirable to have a link level retry scheme for error recovery that reduces the overhead caused by sequence numbers.