To date, network interfaces, which function to transfer network data to and from a computer, have been designed either as add-ons for personal computers and workstations or as part of computers specially designed for parallel computation. While such interfaces have been sufficient in the past, the present tremendous increase in bandwidth of commercial networks is attracting new applications using networked personal computers, PCs, and workstations. Such commercial networks, PCs and workstations are far more cost effective than specially designed computer systems, eg., for parallel computing. However, present network interfaces, particularly for PCs and workstations, do not achieve sufficiently small latencies for such applications.
Note that bandwidth refers to the data transfer rate or the rate at which data is transferred from sender to receiver over the network. Also, for the present purposes, latency refers to the time it takes for data from one location to be transferred and used by a processor, the compute engine of a computer, at a second location, i.e. the delay between the time data is sent until the receiver can act on the transmitted data. Note that the ultimate source and sink of data is a process executing on the sending processor and another process executing on the receiving processor, respectively.
As will be appreciated, part of the end-to-end latency is due to receive overhead in which there is a delay between the arrival of data at a receive node and the time that a process can act on this data. This time includes interrupt handling, processing and copying the data, and kernel traps for a process to read the data. Such receive overhead can be a substantial fraction of the end-to-end latency. In fact, receive overhead in some instances is nearly 90% of the end-to-end latency in conventional operating system implementations of local area networking.
For personal computers/workstations, network interfaces are loosely coupled to the computer memory system by an I/O bus which is distinct from the high speed bus used to couple the memory and processor of the computer. These interfaces are relatively slow, with latencies on the order of 1 msec in a LAN environment, which includes network, hardware and software components of the latency. In general, the network interface itself is a card, connected to a network which plugs into the I/O bus.
For parallel computers, network interfaces are tightly integrated into the design of the computer memory system from the start and hence achieve much greater performance, with latencies typically on the order of 1 to 100 usec. However, even a latency on the order of 100 usec precludes some real time and parallel processing applications.
Since in either the PC/workstation environment or in parallel computers the receive overhead contributes so significantly to latency, there is a necessity to improve receive overhead by improving the interface. Especially in the workstation environment where operating system overhead is the major contributor to latency, it is desirable to provide an interface which eliminates operating system intervention. In general, to maximize the class of exploitable parallel computing applications, it is desirable to have end-to-end latencies less than 10 usec.
Presently, there are two main techniques for transferring data from the network to a receiving processor. In one technique, the processor reads individual data words from the network and for each word decides whether to immediately act on the word or to store the word in memory for later use. This technique, called programmed I/O, is relatively slow because the processor fetches each word, with data being transferred at the slow single word access rate of the I/O bus.
In the second technique, called direct memory access or Direct Memory Access, the network device transmits a block of data words in a single operation directly to the computer memory. In addition to bypassing the word by word examination by the processor, this direct memory access technique transfers data at the burst or peak speed of the I/O bus. While this offers an improvement over the programmed I/O due to the block transfer of data at burst speed, the Direct Memory Access process still suffers from latency problems due to the time it takes to form large blocks of data and copy them into the main memory.
The designers of present network interfaces have concentrated on improving bandwidth. While present techniques can achieve exceptionally high data transfer rates, the delays in copying and processing the received data can negate the advantages of high data rates.
More particularly, as to PC and workstation network interfaces, with the recent commercial availability of high bandwidth networks such as FDDI which operates at 100 Mbps and Asynchronous Transfer Mode, ATM, which operates at 155 Mbps and the promise of 1 Gbps bandwidth networks in the near future, the network interface for PCs and workstations has been the focus of much recent research. It is now fairly well understood how to build network interface hardware and construct operating system software to achieve high bandwidth. As described by Druschel et al. in Network Subsystem Design, IEEE Network, pages 8 to 17, July 1993 and Banks and Prudence, A High-Performance Network Architecture for a PA-RISC Workstation, Journal of Selected Areas in Communications, pages 191-202, February 1993, the key to high bandwidth has been careful attention to minimize the number of data handling steps by the operating system during network data transfers.
One method of achieving high bandwidth is exemplified by a recent Direct Memory Access design, called the Afterburner, which puts a substantial message buffer on the network interface card and integrates this message buffer into the memory hierarchy with the goal of originating and terminating messages in the buffer. The Afterburner design is described by Dalton et al. in an article entitled Afterburner, IEEE Network, pages 36-43, July, 1993. The purpose of the Afterburner system is to eliminate transfers between the network interface card and main memory. However, as will be seen, buffering adds to the end-to-end latency.
The problem with Direct Memory Access-based interfaces, such as the Afterburner, is four-fold. First, the network data must be transferred to and from main memory via the I/O bus which is often slower than the main memory bus.
Second, the main memory, where network data is transferred to and from, is significantly removed in hierarchy from the processor. In today's PCs and workstations, the main memory is often two levels in the memory hierarchy below the processor. A typical state-of-the-art PC or workstation today has a primary cache on the processor chip for frequently accessed data, and another cache, the "secondary" cache, between the processor chip and main memory for less frequently used data. In order for an executing process to act on incoming network data, the data must eventually be loaded into a processor register, which means the data must be loaded from main memory to secondary cache to primary cache. In addition, outgoing messages are frequently generated directly by an executing process, in which case the message data must travel the reverse route through the memory hierarchy. The primary cache generally has an access time of about 5 nsec. The secondary cache generally has an access time in the 20 nsec range and main memory generally has an access time in the 200 nsec range. It will be appreciated that Direct Memory Accessing the network data to and from main memory virtually guarantees an additional 200 nsec delay per data word in copying data to the caches from main memory and vice versa.
Third, keeping the cache contents consistent with the main memory contents increases the receive overhead. Since the Direct Memory Access transfers data to and from main memory, it is possible for a cache to have a stale copy of the data. There are two ways to solve this problem. The usual way is to manually manage the caches. Before Direct Memory Accessing from main memory to the network interface, the operating system must explicitly flush all data to be Direct Memory Accessed out of the cache and back to main memory. Likewise, before Direct Memory Accessing from the network interface to main memory, the operating system must explicitly flush all data out of the cache that resides at the same address as the data that will be Direct Memory Accessed from the network. Since the operating system is involved, this manual cache consistency imposes significant overhead, in addition to the copying of data to and from the cache. A less common way to maintain cache consistency is to have the Direct Memory Access copy data to and from the cache in parallel to and from main memory. However, this requires both extra hardware and stalling the processor during Direct Memory Access activity. It also has the negative side effect of cache dilution. The fourth problem with Direct Memory Access-based interfaces is they typically use an inefficient message protocol and operating system structures. The typical message protocol is addressless, meaning that a message is inserted in a queue on arrival and consequently the operating system must intervene (which adds significant overhead) and usually copy the data. An additional problem, specific to the Afterburner approach, is that the message buffer on the I/O bus is a shared limited resource, which presents resource management and sharing issues. This problem may be mitigated by making the buffer sufficiently large, but this is not cost effective. As to parallel computers which are specially designed for parallel computing from the start, there has always been careful attention paid to achieving high bandwidth and low latency. The I/O bus network interface approach, as described for PCs and workstations above, was used in some machines like the Intel IPSC/i860 but is now mostly abandoned due to its high latency. Recent design and implementation work has concentrated on network interfaces higher up the memory hierarchy at either the cache or register level. The main example of the latter is the *T machine described by Beckerle, in an article entitled An Overview of the *T Computer System, COMPCON, 1993, and implemented by Motorola in which messages are received directly into registers on the processor chip. Although this approach achieves very low latency, it requires extensive modification of the processor chip.
Rather than direct coupling to processor registers, at the cache level there is a continuum of designs between simple cache level buffers to communication coprocessors. A very simple cache level interface, consisting of a message buffer addressed via the cache bus, is suggested by D. Henry and C. Joerg in an article entitled A tightly-coupled Processor Network Interface, published in Proc. of Fifth Int'l Conf. on Architectural Support for Programming Languages and Operating Systems, pp. 111-122, October 1992. This interface suffers from the same problem as the Afterburner message buffer interface. Since it is a small size globally shared resource, it presents a resource management and sharing problem. In the middle, the Thinking Machines CM-5, maps the network to the memory bus. While straightforward and simple, this approach does not attain particularly low latency. In the other extreme are coprocessor-based approaches, such as used in the Meiko CS-2 and the Intel Paragon parallel machines. These coprocessors are fully general processors that offload from the main processor message send and receive duties such as message formatting, interrupt handling, protection, and memory mapping. Further, with the exception of protection and memory mapping, the MIT Alewife machine implements similar functions in a hardware finite state machine rather than a full processor. These co-processor approaches are expensive in terms of hardware.
In order to minimize both cost and latency, the Fujitsu AP1000 attempts to integrate the network interface into the cache: messages are sent from the cache but are received into a separate buffer as described shortly. This technique is described by Shimizu, Horie, and Ishihata in an article entitled Low Latency Message Communication Support for the AP1000In Int'l Symposium on Computer Architecture, pages 288-297, May 1992.
As to the send operation for the AP1000, a message is composed in the cache and is then sent by Direct Memory Accessing the cache line directly to the network. Without changing the processor to support sending directly from registers, there isn't much one can do to improve on this "cache line sending" technique.
For the receive operation, rather than utilizing traditional Direct Memory Access techniques, the Fujitsu system utilizes a circular FIFO buffer coupled to the network and messages which incorporate a message ID number and the relevant data. The circular buffer is coupled to a cache in an attempt to integrate the network interface into the cache. However, the messages are not retrieved by address but rather by message ID number. The messages arrive and are stored in the circular buffer along with the message ID number. During message retrieval, a message is accessed by the message ID number. Thus it is first necessary for the Fujitsu system to search the buffer for the message ID number. If the message ID number is found, then it is possible to ascertain the buffer position and read out the data from that position. The result is that while it is possible to couple data rapidly to the cache bus, it is indeed an extremely slow process to receive data.
In summary, each AP1000 processor node has a circular buffer connected to the cache bus for receiving messages. This is in essence the same concept as the cache level message buffer in the above-mentioned article by Henry and Joerg, except that the AP1000 requests a latency intense search through the receive buffer to find a matching message, thus negating any latency gains otherwise achievable.
There are also three additional problems with separate cache-level message buffers:
First, as to buffer management, since the buffer is a finite sized resource shared by all communicating processes there are the usual problems of reclaiming buffers, ensuring fair distribution of buffer blocks amongst all processes, and buffer overflow. Pressure arises from the need to keep the buffer size rather small in order to be suitably fast and not inordinately expensive. Secondly, as to integration with process address space, because of the difficulties in integrating a small shared buffer into a page-based protection scheme, the message buffer has to sit outside the process address space. This poses a number of protection issues such as how to prevent a process from reading or writing on top of messages for another process. Thirdly, as to performance, to send or receive data, an application has to transfer the data to or from the message buffer. This means an extra copy step. While the access time of the message buffer is likely to be quite small, the application code must be organized to copy such messages when needed and the actual copy will require main memory accesses if there are no free cache blocks.
By way of further background, note that parallel computers differ in two very important ways from PCs and workstations. First, they are often single user environments, so no protection is necessary to contain the accidental or the intentional maliciousness of a user from others. Second, network in parallel computing machines is private to the machine, so the network may be manipulated to control protection issues as in the CM-5. In contrast, the PC and workstation environment is a multi-user environment and the network is public. Thus the network interface must provide protection to contain the accidental or intentional maliciousness of a user from others. The parallel computer environment is often a single user environment with no protection since none is necessary. It is therefore important that a network interface design improve on the Fujitsu AP1000 cache interface while at the same time guaranteeing a protected, multi-user environment which connects to a public network.
Note that another important direction for achieving low latency in parallel computing, is to incorporate the address of the message handler in the message itself, obviating the need for buffering of the message data at the receiver. This technique as described in Active Messages: A Mechanism for Integrated Communication and Computation by von Eicken et al. Proceedings of the 19th Annual International Symposium on Computer Architecture, May 1992. This paper describes an exclusively software approach not involving caches.