Network Interface Controllers or NIC's are communication devices that are used to send and receive messages from one node (computing system) to another node through a communication network. These NIC's reside in every computer system that accesses a network or the internet. NIC's may be found in laptop computers, wireless PDA's, Enterprise servers, or compute-intensive clustered processors, such as research computer clusters.
An existing flow control protocol, known as Stop and Wait “Automatic Repeat Request” (ARQ), transmits a data packet and then waits for an acknowledgment (ACK) from the termination node before transmitting the next packet. As data packets flow through the network from node to node, latency becomes a problem. Latency results from the large number of links in the fabric because each packet requires an acknowledgment of successful receipt from the receiving node before the next packet can be sent from the transmitting node. Consequently, there is an inherent delay resulting from the transit time for the acknowledgment to reach the transmitting node from the receiver.
One solution, which is known as “Go Back n ARQ”, uses sequentially numbered packets, in which a sequence number is sent in the header of the frame containing the packet. In this case, several successive packets are sent up to the limit of the receive buffer, but without waiting for the return of the acknowledgment. According to this protocol, the receiving node only accepts the packets in the correct order and sends request numbers (RN) back to the transmitting node along with the flow control information, such as the state of the receive buffer. The effect of a given request number is to acknowledge all packets prior to the requested packet and to request transmission of the packet associated with the request number. The go back number n is a parameter that determines how many successive packets can be sent from the transmitter in the absence of a request for a new packet.
Specifically, the transmitting node is generally not allowed to send packet i+n before i has been acknowledged (i.e., before i+1 has been requested). Thus, if i is the most recently received request from the receiving node, there is a window of n packets that the transmitter is allowed to send before receiving the next acknowledgment. In this protocol, if there is an error, the entire window must be resent as the receiver will only permit reception of the packets in order. Thus, even if the error lies near the end of the window, the entire window must be retransmitted. This protocol is most suitable for large scaled networks having high probabilities of error. In this protocol, the window size n is based on the size of the receive buffer. Thus, the transmitter does not send more data than the receiver can buffer. Consequently, at start up, the two nodes must transmit information to each other regarding the size of their buffers and defaulting to the smaller of the two buffers during operation.
In an architecture that permits large data packets, unnecessarily retransmitting excess packets can become a significant efficiency concern. For example, retransmitting an entire window of data packets, each on the order of 4 Gigabytes, would be relatively inefficient.
Other known flow control protocols require retransmission of only the packet received in error. This requires the receiver to maintain a buffer of the correctly received packets and to reorder them upon successful receipt of the retransmitted packet. While keeping the bandwidth requirements to a minimum, this protocol significantly complicates the receiver design as compared to that required by “Go Back n ARQ”. Many of the network architectures in use today are highly reliable and the risk of a dropped packet is minimal. In these environments, large groupings of computers known as computer clusters share large amounts of data across the network.
Computer clusters are multiple-node computer systems that may have more than 1000 nodes. All nodes in a computer cluster are networked so that any node can send to or receive from any other node. Techniques such as message passing allow messages to be sent from any node to any other node. A single NIC on a source node can send a message to any NIC on any destination node. Or a single destination NIC might receive a message from any source NIC. The arrival of received messages cannot be easily predicted and there is substantial risk that a NIC's receive buffer may be insufficient in size to contain all receive messages. In this case, messages may be lost.
Credit based flow control is used to prevent remote senders from sending messages to a receiver when there may be insufficient space to store received messages. Credits are associated with free storage. Initially, all storage is unused or free and the sum of all credits for a buffer should not exceed the total free storage space provided by the buffer. Available credits can be given to any sender and the pool of free credits can be diminished. When the sender sends a message that fits within its available credits, the sender can be guaranteed that there is sufficient space in the receiver.
Prior art credit management systems manage the flow of credits between a sender and a single receive buffer receives only from that sender. These credit management solutions use connection-based credit management. In this case, a distinct receive buffer is allocated for every potential sender. Credits are exchanged on a per-connection basis between a single sender and a single receiver. For computer clusters consisting of a very large number of nodes, this is a wasteful approach requiring more than, for example, a thousand dedicated receive buffers most of which are empty at any moment in time.
Thus, a need still remains for a virtual network interface system with memory management. In view of the increasing use of computer clusters to address massive compute problems, it is increasingly critical that answers be found to these problems. In view of the ever-increasing commercial competitive pressures, along with growing consumer expectations and the diminishing opportunities for meaningful product differentiation in the marketplace, it is critical that answers be found for these problems. Additionally, the need to improve efficiencies and performance, and meet competitive pressures, adds an even greater urgency to the critical necessity for finding answers to these problems.
Solutions to these problems have been long sought but prior developments have not taught or suggested any solutions and, thus, solutions to these problems have long eluded those skilled in the art.