Latency of messages in a network is typically linked to high end-to-end performance for applications. However, certain computer applications such as network memory or file access require low latency for large messages and high bandwidth or throughput under load in order to perform optimally. Reconciling these conflicting demands requires careful attention to data movement across data buses, network interfaces, and network links.
One method of achieving high data throughput is to send larger data packets and thus reduce per-packet overheads. On the other hand, a key technique for achieving low latency is to fragment data packets or messages and pipeline the fragments through the network, overlapping transfers on the network links and I/O buses. Since it is not possible to do both at once, messaging systems must select which strategy to use.
It is therefore desirable to automatically adapt a fragmentation policy along the continuum between low latency and high bandwidth based on the characteristics of system hardware, application behavior, and network traffic.
A number of prior art systems have used fragmentation/reassembly to reduce network latency on networks whose architecture utilizes a fixed delimited transmission unit such as ATM (asynchronous transfer mode). Fixed fragmentation is a common latency reduction scheme used in the following systems: APIC, Fast Messages (FM), Active Messages (AM), and Basic Interface for Parallelism (BIP). PM (to be discussed) appears to utilize a form of variable fragmentation only on packet transmission. Variable and hierarchical fragmentation were theoretically explored in Wang et. al. In contrast, a technique termed cut-through delivery was developed in the Trapeze Myrinet messaging system. Cut-through delivery is a non-static variable fragmentation scheme meaning fragment sizes can vary at each stage in a pipeline. The following prior-art discussion describes messaging software developed for a Myrinet gigabit network unless otherwise noted.
The design of the APIC (ATM Port Interconnect Controller) network interface card (NIC) specifies implementation of full AAL-5 segmentation and reassembly(SAR) on-chip. The APIC NIC uses fixed-size fragmentation at the cell granularity(48 byte of data), so it does not store and forward entire frames. Moreover, APIC does not adapt to host/network architectures or to changing conditions on the host or network. (See generally, Dittia et al., The APIC Approach to High Performance Network Interface Design: Protected DMA and Other Techniques, Proceedings of INFOCOM'97, April, 1997)
Fast Messages (FM) utilizes fixed-size fragmentation when moving data between the host, NIC, and network link in order to lower the latency of large messages. Though FM uses a streaming interface that allows a programmer to manually pipeline transfers in variably sized fragments to and from host API buffers, API data moves to and from the network interface card in fixed size fragments. Thus, it is the programmer's task to pipeline packets by making multiple calls to the application programming interface (API). FM lacks the ability to adapt automatically and transparently to changing host and network characteristics. (See generally, Lauria et al., Efficient Layering for High Speed Communication: Fast Messages 2.x, IEEE, July, 1998)
Active Messages (AM) uses a fixed-size fragmentation scheme to reduce latency of medium to large packets. Active messages, however, is non-adaptive and utilizes store and forward for non-bulk packets as a means for increasing throughput.
Basic Interface for Parallelism (BIP) performs static fixed-size fragmentation on the adapter. BIP, however, adjusts the fragment size depending on the size of the entire packet. When a packet is sent, fragment size is determined by a table look-up as indexed by the packet's length. BIP, while statically adaptive to packet size, does not adjust dynamically to changing host and network characteristics. (See generally, Prylli et al., Modeling of a High Speed Network Thrughput Performance: The Experience of BIP over Myrinet, September 1997)
The Real World Computing Partnership has developed a messaging package, also for Myrinet, called PM which implements fragmentation on the adapter for sending in a technique they term immediate sending. Double buffering is used for receiving. It is unclear from their current documents exactly what form of fragmentation constitutes immediate sending, but it appears to be a form of variable fragmentation. Moreover, their technique is limited since PM claims it is not possible to perform immediate sending on the reception of a packet. (See generally, Tezuka et al., PM: An Operating System Coordinated High Performance Communication Library, Real World Computing Partnership) In a theoretical approach, Wang et al. examines variable sized and hierarchical fragmentation pipelining strategies. Hierarchical fragmentation is one scheme in which a fragmentation schedule may change in different pipeline stages; it is not a static pipelining method. The theory rests on different assumptions than the present invention, adaptive message pipelining (AMP). Wang et al. assumes that g.sub.i (fixed transfer overhead) and G.sub.i (time per unit of data) values are fixed and previously known, so that both static and non-static pipeline schedules can be computed beforehand, and therefore are not adaptable to changing conditions. Neither does Wang et al. consider throughput as a goal in any of their studied pipelining strategies. (See generally, Modeling and Optimzing Communication Pipelines, Proceedings of ACM International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS, June 1998)
Cut-through delivery, disclosed in a previously published paper, is a variable sized fragmentation scheme in which the schedules are not static for any stage of a pipeline. Cut through delivery alone, however, is unable to adjust to inter-packet pipelining and therefore cannot extract the maximum bandwidth from the underlying system hardware. (Yocum et al., Cut-Through Delivery in Trapeze: An Exercise in Low-Latency Messaging, Proc. Of Sixth IEEE International Symposium on High Performance Distributed Computing, August, 1997)
The approach of the present invention differs from the prior-art in several ways. First, pipelining is implemented on the network interface card (NIC), transparently to the hosts and host network software. Second, selection of transfer sizes at each stage is automatic, dynamic, and adaptive to congestion conditions encountered within the pipeline. The schedules, therefore, are variable and non-static. Third, the user of the API does not need to know anything about the hardware or network characteristics or load in order to achieve both low-latency and high bandwidth.