1 . Field of the Invention
The present invention relates generally to network communications, and more particularly, to efficient transmission of bursts of packets for optimizing a network traffic utilization.
2 . Description of Related Art
A typical computer system includes a processor subsystem (including one or more processors), a memory subsystem (including main memory, cache memory, etc.; also sometimes referred to herein as “host memory”), and a variety of “peripheral devices”connected to the processor subsystem via a peripheral bus. Peripheral devices may include, for example, keyboard, mouse and display adapters, disk drives and CD-ROM drives, network interface devices, and so on. The processor subsystem communicates with the peripheral devices by reading and writing commands and information to specific addresses that have been preassigned to the devices. The addresses may be preassigned regions of a main memory address space, an I/O address space, or another kind of configuration space. Communication with peripheral devices can also take place via direct memory access (DMA), in which the peripheral devices (or another agent on the peripheral bus) transfers data directly between the memory subsystem and one of the preassigned regions of address space assigned to the peripheral devices.
When large amounts of data are to be transferred across between the memory subsystem and a peripheral device, it is usually highly inefficient to accomplish this by having the processor subsystem retrieve the data from memory and write it to the peripheral device, or vice-versa. This method occupies an enormous amount of the processor's time and resources, which could otherwise be used to advance other processing jobs. It is typically much more efficient to offload these data transfers to a data transfer DMA engine, which can control the transfers while the processor subsystem works on other jobs. The processing subsystem controls the data transfer DMA engine by issuing DMA commands to it, the commands identifying in one way or another the starting address in either host memory or the peripheral device or both, and the length of the transfer desired. DMA commands are also sometimes referred to herein as DMA descriptors, and the portion of a DMA command that identifies a starting address is sometimes referred to herein as a pointer. As used herein, “identification” of an item of information does not necessarily require the direct specification of that item of information. Information can be “identified” in a field simply by referring to the actual information through one or more layers of indirection, or by identifying one or more items of different information which are together sufficient to determine the actual item of information. For example, a pointer “identifying” a starting address in host memory may specify the entire physical host memory address, or it may specify an address in a larger memory address space which is mapped to a physical address, or it might specify a virtual address which is mapped to a physical address, or it might specify an address in a user address space which is mapped to a physical address in further dependence upon a user ID of some kind, or it may identify in any of these ways an address that is one less or one greater than the actual starting address identified, and so on. In addition, the term “indicate” is used herein to mean the same as “identify”.
In various different computer system arrangements, the data transfer DMA engine may be located across a communication channel from the source of the DMA commands. Often this communication channel is the same as the peripheral bus via which the data itself is transferred, but in some systems it could involve a different bus, either instead of or additionally to the peripheral bus. Often it is advantageous to transfer DMA commands to the data transfer DMA engine in bursts rather than individually, especially where the communication channel supports a burst transfer mode. In a burst transfer mode, multiple data units can be transferred based on only a single starting address identification because logic on both sides of the communication channel know and agree on how to increment the address automatically for the second and subsequent data units. If the communication bus is shared by other agents, then bursts can be advantageous even if there is no special burst transfer mode because arbitration delays are reduced.
For the same reasons that it is advantageous to offload data transfers to a data transfer DMA engine, it is often advantageous to also offload DMA command transfers to a command transfer DMA engine. The command transfer DMA engine may be the same as or different from the data transfer DMA engine in different embodiments. In order to use a command transfer DMA engine, the processor subsystem creates a DMA command queue in a memory that is accessible to the processor subsystem without crossing the communication channel. Typically the DMA command queue is created in the memory subsystem. The processor subsystem then programs the command transfer DMA engine to transfer one or more DMA commands, across the communication channel, from the queue to a local memory that is accessible to the data transfer DMA engine without again crossing the communication channel. Typically the programming of the command transfer DMA engine includes, among other things, programming in the host memory address from which the first data transfer DMA command is to be read, the address in the local memory to which the first data transfer DMA command is to be written, and an identification of the length of the transfer. The data transfer DMA engine then reads the DMA commands from the local memory and executes them in a known sequence.
One type of peripheral device that often requires the transfer of large amounts of data between the peripheral device and the memory subsystem is a network interface device. Network interface devices were historically implemented on plug-in cards, and therefore are sometimes referred to as network interface cards (NICs). As used herein, though, a NIC need not be implemented on a card. For instance it could be in the form of integrated circuits (ICs) and connectors fitted directly onto a motherboard, or in the form of macrocells fabricated on a single integrated circuit chip with other components of the computer system. Since a NIC will typically both transmit and receive data from a network, the processor subsystem may set up two DMA command queues in the memory subsystem, a transmit (Tx) DMA command queue identifying data buffers in memory ready for transmit onto the network, and a receive (Rx) DMA command queue identifying data buffers in memory that are available to receive data incoming from the network. Since transmit and receive buffers are not typically used at even rates, the NIC's local memory may maintain separate transmit and receive queues as well.
The command transfer DMA engine needs to know both the host memory address from which the first data transfer DMA command is to be read, and the address in the local memory to which the first data transfer DMA command is to be written. If there is only a single DMA command queue and a single local store for storing the retrieved data transfer DMA commands, then the peripheral device need only have storage for two address pointers to implement the retrieval of data transfer commands by DMA: the host memory address from which the first data transfer DMA command is to be read (a read pointer), and the address in the local memory to which the first data transfer DMA command is to be written (a write pointer). The storage space required to implement these two pointers is not a stretch in modem technologies. In the NIC situation described above there are two host memory queues and two local stores for storing retrieved data transfer DMA commands, so in this situation storage for four address pointers is needed.
Some NICs implement multiple, e.g. up to about 8 or 16, physical network ports. For these NICs, it may be desirable to implement a separate pair of queues in host memory for each physical port, and a corresponding pair of local stores for each physical port. In this situation storage for up to about 16 or 32 address pointers might be needed. This requirement is still not exorbitant, but still it would be desirable to reduce the size of this address pointer storage to reduce space utilization if possible.
U.S. patent application No. 11/050,476, filed Feb. 3, 2005, entitled “Interrupt Management for Multiple Event Queues” and U.K. Patent Application No. GB0408876A0, filed Apr. 21, 2004, entitled “User-level Stack”, both incorporated herein by reference, both describe architectures in which the operating system supports numerous protocol stacks, each with its own set of transmit and receive data structures, and all assisted by functions performed in hardware on the NIC. The number of transmit and receive data queues can number in the thousands, with a corresponding number of local stores for storing retrieved data transfer DMA commands. Many thousands of address pointers are required in such an architecture, occupying significant space on an integrated circuit chip. For example, with 4 k Tx DMA command queues and 4 k Tx DMA command queues, and a corresponding number (8 k) local stores for storing retrieved data transfer DMA commands, storage is required on the NIC for 8 k read pointers and 8 k write pointers. If each local store requires 7 bits to uniquely address each entry (i.e. the store can hold 128 entries), then storage for 56 k bits are required just to hold the write pointers.
Transmission Control Protocol (TCP) was designed to operate over a variety of communication links. Advances in different communication mediums, including high-bandwidth links, wireless, fiber-optics networks, and satellite present a situation where there may be an increasingly large discrepancy between the bandwidth capacities of receiving stations. TCP compounds the mismatches between the transmission rate and the various receiving links by sending bursts (or windows) of packets, which affects the throughput, fairness, queue size, and drop rate in network communications.
In a network topology, a transmitting station communicates with multiple receiving stations through different communication links. Different communication links may require different transmission rates beyond which a receiving station will start dropping packets. While TCP can handle packet discard and re-transmission, the result is degradation of the overall system performance. The difference in transmission rate between communication links can be attributed to many factors. For example, the receiving station is located one or more switch or hub away from the transmitting station and may be on a slow link, which can be as slow as half-duplex at 10 Mbps. There may be many-to-one congestion at a particular link, i.e. too many nodes transmitting to the node. The receiving station's TCP stack may not be efficient in handling packets.
A transmit port of a fast NIC is likely to be connected to a fast link and capable of sending packets at much faster rates than some downstream links can process. The Institute of Electrical and Electronic Engineers (IEEE) 802.3x standard specifies a port-based flow control, which works if a switch can support it, to flow the control to the transmitting port. This type of flow control can create head-of-line blocking and causes all nodes receiving from this transmitting port to be slowed down.
Accordingly, it would be greatly desirable to be able to pace the transmission of packets from a transmitting station to different receiving stations based on the bandwidth of each communication link associated with a particular receiving station, thereby optimizing the performance of the overall network performance.