In data networks it is important to enable efficient and reliable transfer of data between devices. Data can only reliably be transferred over a connection between two devices at the rate that a bottleneck in the connection can deal with. For example, a switch in a TCP/IP configured connection may be able to pass data at a speed of 10 Mbps while other elements of the connection can pass data at, say, 100 Mbps. The lowest data rate determines the maximum overall rate for the connection which, in this example, would be 10 Mbps. If data is transmitted between two devices at a higher speed, packets will be dropped and will subsequently need to be retransmitted. If more than one link is combined over a connector such as a switch, then the buffering capacity of the connector needs to be taken into account in determining maximum rates for the links, otherwise data loss could occur at the connector.
FIG. 1 shows the architecture of a typical networked computing unit 1. Block 6 indicates the hardware domain of the computing unit. In the hardware domain the unit includes a processor 2 which is connected to a program store 4 and a working memory 5. The program store stores program code for execution by the processor 2 and could, for example, be a hard disc. The working memory could, for example, be a random access memory (RAM) chip. The processor is connected via a network interface card (NIC) 2 to a network 10. Although the NIC is conventionally termed a card, it need not be in the form of a card: it could for instance be in the form of an integrated circuit or it could be incorporated into the hardware that embodies the processor 2. In the software domain the computing unit implements an operating system 8 which supports an application 9. The operating system and the application are implemented by the execution by the processor 3 of program code such as that stored in the program store 4.
When the computing unit receives data over the network, that data may have to be passed to the application. Conventionally the data does not pass directly to the application. One reason for this is that it may be desired that the operating system polices interactions between the application and the hardware. As a result, the application may be required to interface with the hardware via the operating system. Another reason is that data may arrive from the network at any time, but the application cannot be assumed always to be receptive to data from the network. The application could, for example, be de-scheduled or could be engaged on another task when the data arrives. It is therefore necessary to provide an input mechanism (conventionally an input/output (I/O) mechanism) whereby the application can access received data.
FIG. 2 shows an architecture employing a standard kernel TCP transport (TCPk). The operation of this architecture is as follows.
On packet reception from the network, interface hardware 101 (e.g. a NIC) transfers data into a pre-allocated data buffer (a) and invokes an interrupt handler in the operating system (OS) 100 by means of an interrupt line (step i). The interrupt handler manages the hardware interface. For example, it can indicate available buffers for receiving data by means of post( ) system calls, and it can pass the received packet (for example an Ethernet packet) and identify protocol information. If a packet is identified as destined for a valid protocol e.g. TCP/IP it is passed (not copied) to the appropriate receive protocol processing block (step ii).
TCP receive-side processing then takes place and the destination port is identified from the packet. If the packet contains valid data for the port then the packet is engaged on the port's data queue (step iii) and the port is marked as holding valid data. Marking could be performed by means of a scheduler in the OS 100, and it could involve awakening a blocked process such that the process will then respond to the presence of the data.
In some circumstances the TCP receive processing may require other packets to be transmitted (step iv), for example where previously transmitted data needs to be retransmitted or where previously enqueued data can now be transmitted, perhaps because the TCP transmit window (discussed below) has increased. In these cases packets are enqueued with the OS Network Driver Interface Specification (“NDIS”) driver 103 for transmission.
In order for an application to retrieve data from a data buffer it must invoke the OS Application Program Interface (API) 104 (step v), for example by means of a call such as recv( ), select( ) or poll( ). These calls enable the application to check whether data for that application has been received over the network. A recv( ) call initially causes copying of the data from the kernel buffer to the application's buffer. The copying enables the kernel of the OS to reuse the buffers which it has allocated for storing network data, and which have special attributes such as being DMA accessible. The copying can also mean that the application does not necessarily have to handle data in units provided by the network, or that the application needs to know a priori the final destination of the data, or that the application must pre-allocate buffers which can then be used for data reception.
It should be noted that on the receive side there are at least two distinct threads of control which interact asynchronously: the up-call from the interrupt and the system call from the application (described in co-pending application WO2005/074611). Many operating systems will also split the up-call to avoid executing too much code at interrupt priority, for example by means of “soft interrupt” or “deferred procedure call” techniques.
The send process behaves similarly except that there is usually one path of execution. The application calls the operating system API 104 (e.g. using a send( ) call) with data to be transmitted (step vi). This call copies data into a kernel data buffer and invokes TCP send processing. Here protocol is applied and fully formed TCP/IP packets are enqueued with the interface driver 103 for transmission.
If successful, the system call returns with an indication of the data scheduled (by the hardware 101) for transmission. However there are a number of circumstances where data does not become enqueued by the network interface device. For example the transport protocol may queue pending acknowledgements from the device to which it is transmitting, or pending window updates (discussed below), and the device driver 103 may queue in software pending data transmission requests to the hardware 101.
A third flow of control through the system is generated by actions which must be performed on the passing of time. One example is the triggering of retransmission algorithms. Generally the operating system 100 provides all OS modules with time and scheduling services (typically driven by interrupts triggered by the hardware clock 102), which enable the TCP stack to implement timers on a per-connection basis. Such a hardware timer is generally required in a user-level architecture, since then data can be received at a NIC without any thread of an application being aware of that data. In addition to a hardware timer of this type, timers can be provided (typically in software) to ensure that protocol processing advances.
The setting of a software timer for ensuring the advance of protocol processing can impact on the efficiency of data transfer over the network. The timer can for example be instructed by a transport protocol library of the application to start counting when a new packet is delivered from the NIC to the transport protocol library. On expiry of a timeout, the timer causes an event to be delivered to an event queue in the kernel, for example by issuing an event from the NIC 101, the event identifying an event queue in the OS. At the same time as the event is delivered, an interrupt is scheduled to be delivered to the OS. According to the interrupt moderation rules in force, an interrupt is raised and the OS will start to execute device driver code to process events in the event queue. Thus, the software timer can be arranged to trigger protocol processing of data received at the data processor over the network, or to trigger protocol processing of data for transmission over the network. Such a timer preferably causes the kernel to be invoked relatively soon (for example within 250 ms) after the receipt of data at the NIC.
FIG. 3a illustrates a conventional synchronous I/O mechanism. An application 32 running on an OS 33 is supported by a socket 50 and a transport library 36. The transport library has a receive buffer 51 allocated to it. The buffer could be an area of memory in the memory 5 shown in FIG. 1. When data is received by the NIC 31 it writes that data to the buffer 51. When the application 32 wants to receive the data it issues a receive command (recv) to the transport library via the socket 50. In response, the transport library transmits to the application a message that includes the contents of the buffer. This involves copying the contents of the buffer into the message and storing the copied contents in a buffer 52 of the application. In response to obtaining this data, the application may cause messages to be issued, such as an acknowledgement to the device which transmitted the data. A problem with this I/O mechanism is that if the application fails to service the buffer often enough then the buffer 51 can become full, as a consequence of which no more data can be received.
FIG. 3b illustrates a conventional asynchronous I/O mechanism. This mechanism avoids the overhead of copying the data by transferring ownership of buffers between the transport library and the application. Before data is to be received, the application 32 has a set of buffers (B1-B3) allocated to it. It then passes ownership of those buffers to the transport library 36 by transmitting to the transport library one or more post( ) commands that specify those buffers. When data is received it is written into those buffers. When the application wants to access the data it takes ownership of one or more of the buffers back from the transport library. This can be done using a gather( ) command that specifies the buffers whose ownership is to be taken back. The application can then access those buffers directly to read the data. A problem with this I/O arrangement is that the amount of data that is collected when the gather( ) command is executed could be very large, if a large amount of buffer space has been allocated to the transport library, and as a result the application may need considerable time to process that data.
Thus, with both of these mechanisms problems can arise if the application services the buffers at too fast or too slow a rate. If the buffers are serviced too infrequently then they can become full (in which case the reception of data must be suspended) or the amount of data that is returned to the application when the buffers are serviced could be very large. However, if the buffers are serviced too frequently then there will be excessive communication overheads between the application and the transport library as messages are sent between the two. One way of addressing these problems is to arrange for the transport library to set a timer that, on reaching a timeout, triggers the operating system to assist in processing any received data. This is particularly useful in the case of a user-level network architecture, where the transport library is normally driven by synchronous I/O calls from the application. The timer could, for example, run on the NIC. This mechanism can improve throughput but it has the disadvantage that it involves interrupts being set to activate the operating system to process the data. Processing interrupts involves overhead, and there may also only be a limited number of interrupts available in the system.
There is therefore a need for a mechanism which can increase the efficiency with which data can be protocol processed.