The present invention relates generally to computer systems, and more particularly to techniques for implementing communication protocols in operating systems.
FIG. 1 illustrates how a FreeBSD operating system processes Internet protocol (IP) packets received from an Ethernet network. This protocol processing organization is referred to as the xe2x80x9cBSD approachxe2x80x9d because it is derived from that used in the Berkeley Software Distribution (BSD) Unix operating system. With minor variations, the organization shown in FIG. 1 is found in many other operating systems, including cases of protocol families other than IP and networks other than Ethernet.
In the BSD approach, incoming packets are processed at software interrupt level, at a priority higher than that of any application. Input protocol processing is not scheduled and is charged to the interrupted application, even if that application is unrelated to the received packets. This leads to two undesirable consequences. First, high receive loads, e.g., due to a server xe2x80x9chot spotxe2x80x9d or a denial-of-service attack, can make the system unable to process any application. This is the so-called xe2x80x9creceive livelockxe2x80x9d problem as described in, e.g., J. Mogul and K. K. Rarnakrishnan, xe2x80x9cEliminating receive livelock in an interrupt-driven kernel,xe2x80x9d Proceedings of Annual Tech. Conf., USENIX, 1996. Second, because protocol processing of received packets is unscheduled, the system cannot enforce CPU allocations and thus cannot provide quality of service (QoS) guarantees to applications.
As shown in FIG. 1, in FreeBSD, arrival of an IP packet causes a hardware interrupt that transfers central processing unit (CPU) control to a network interface driver 10. The driver 10 retrieves the packet from the corresponding network interface hardware, prepares the hardware for receiving a future packet, and passes the received packet to an ether_input routine 12. The ether_input routine 12 places the packet in an IP input queue 14 without demultiplexing, i.e., all IP packets go into the same input queue 14. The ether_input routine 12 then issues a network software interrupt. This software interrupt has a priority higher than that of any application, but lower than that of the hardware interrupt.
FreeBSD handles the network software interrupt by dequeuing each packet from the IP input queue 14 and calling an ip_input routine 15. The ip_input routine 15 performs a checksum on the packet""s IP header and submits the packet to preliminary processing operations such as, e.g., firewalling 16 and/or network address translation (NAT) 18, if configured in the system, and IP options 20, if present in the packet header. This preliminary processing may drop, modify, or forward the packet.
The ip_input routine 15 then checks the packet""s destination IP address. If that address is the same as one of the host""s addresses, the ip_input routine 15 jumps to its ip_input_ours label 21, reassembles the packet, and passes the packet to the input routine of the higher-layer protocol selected in the packet header, e.g., transmission control protocol (TCP) input routine 22-1, user datagram protocol (UDP) input routine 22-2, IP in IP tunneling (IPIP) input routine 22-4, resource reservation protocol (RSVP) input routine 22-5, Internet group management protocol (IGMP) input routine 22-6, Internet control message protocol (ICMP) input routine 22-7, or, for other protocols implemented by a user-level application, raw IP (RIP) input routine 22-3. Otherwise, if the destination is a multicast address, the ip_input routine 15 submits the packet to a higher-layer protocol, for local delivery, and to the ip_mforward routine 24, if the system is configured as a multicast router. Finally, if the destination IP address matches neither one of the host""s addresses nor a multicast address, and the system is configured as a gateway, the ip_input routine 15 submits the packet to the ip_forward routine 26; otherwise, the ip_input routine 15 drops the packet. The ip_mforward routine 24, ip_forward routine 26, and one or more of the routines 22-1 may make use of the ip_output routine 27.
The TCP and UDP input routines 22-1 and 22-2, respectively, checksum the packet and then demultiplex it. These routines find the protocol control block (PCB) that corresponds to the destination port selected in the packet header, append the packet to the respective socket receive queue 28, and wake up receiving processes 29 that are waiting for that queue to be non-empty. However, if the socket receive queue 28 is full, FreeBSD drops the packet.
Protocol processing of a received packet in FreeBSD is asynchronous relative to the corresponding receiving processes 29. On a receive call, a receiving process 29 checks the socket receive queue 28. If the queue is empty, the receiving process sleeps; otherwise, the receiving process dequeues the data and copies it out to application buffers.
The BSD approach to protocol processing of received packets has two main disadvantages. First, it is prone to the above-mentioned problem of receive livelock. Because demultiplexing occurs so late, packets destined to the host are dropped only after protocol processing has already occurred. Applications only get a chance to run if the receive load is not so high that all CPU time is spent processing network hardware or software interrupts. Second, even at moderate receive loads, process scheduling may be affected by the fact that the CPU time spent processing network interrupts is charged to whatever process was interrupted, even if that process is unrelated to the received packets. Such incorrect accounting of CPU usage may prevent the operating system from enforcing CPU allocations, thus causing scheduling anomalies.
An alternative protocol processing organization, lazy receiver processing (LRP), is illustrated in FIGS. 2A and 2B. LRP is described in detail in P. Druschel and G. Banga, xe2x80x9cLazy receiver processing (LRP): a network subsystem architecture for server systems,xe2x80x9d Proceedings of OSDI""96, USENIX, 1996. Instead of the single IP input queue 14 of the above-described BSD approach, LRP uses separate packet queues referred to as channels, with one channel 30-i associated with each socket i. LRP employs early demultiplexing, that is, the network interface hardware, or the network interface driver 10 and the ether_input routine 12, examine the header of each packet and enqueue the packet directly in the channel that corresponds to the header, e.g., channel 30-1 in FIG. 2A or channel 30-2 in FIG. 2B. Following a hardware interrupt, LRP wakes up the processes that are waiting for the channel to be non-empty. However, if the given channel is full, the network interface drops the packet immediately, before further protocol processing.
The LRP approach handles TCP and UDP packets differently. In the UDP case, illustrated in FIG. 2B, the receiving process 32-2 performs the following loop while there is not enough data in the socket receive queue 34-2: While the corresponding channel 30-2 is empty, sleep; then dequeue each packet from the channel 30-2 and submit the packet to the ip_input routine 15, which calls the udp_input routine 22-2, which finally enqueues the packet in the socket receive queue 34-2. The receiving process 32-2 then dequeues the data from the socket receive queue 34-2 and copies it out to application buffers. Therefore, for UDP packets, LRP is synchronous relative to the receiving process""s receive calls.
In the TCP case, illustrated in FIG. 2A, LRP is asynchronous relative to the receiving process 32-1. LRP cannot be synchronous relative to the receiving process 32-1 in the TCP case because (1) LRP was designed to be completely transparent to applications, and (2) in some applications, synchronous protocol processing could cause large or variable delays in TCP acknowledgements, adversely affecting throughput. In order to process TCP asynchronously without resorting to software interrupts, LRP associates with each process 32-1 an extra kernel thread 33 that is scheduled at the priority of process 32-1 and has its resource utilization charged to process 32-1. The kernel thread 33 continuously performs the following loop: While the process""s TCP channels are empty, sleep; then dequeue each packet from a non-empty TCP channel, e.g., channel 30-1, and submit the packet to the ip_input routine 15, which calls the tcp_input routine 22-1, which finally enqueues the packet in the respective socket receive queue 34-1. LRP handles TCP receive calls similarly to FreeBSD: The receiving process simply checks the socket receive queue and, if the queue is empty, sleeps; otherwise, the process dequeues the data and copies it out to application buffers.
Although LRP, illustrated in FIGS. 2A and 2B, can provide advantages over FreeBSD""s protocol processing organization, illustrated in FIG. 1, LRP has a number of significant drawbacks. For example, current versions of many operating systems, including FreeBSD, do not support kernel threads, which are necessary for LRP""s TCP processing. Another serious drawback is that contemporary operating systems, including FreeBSD and Linux, often provide firewalling, NAT and other features that may drop packets or change packet headers, precluding LRP""s early demultiplexing. A further difficulty of the LRP approach is that it requires all protocol processing to be scheduled, but the operating system""s scheduling policies may not be appropriate for some of that processing. Most operating systems support time-sharing scheduling, which penalizes processes for their CPU consumption and may be appropriate for host protocol functionality, that is, for processing packets whose source or destination is an application running on the same node. On the other hand, few operating systems provide QoS guarantees, e.g., via proportional-share scheduling. Proportional-share scheduling can guarantee to each process at least a certain share of the CPU and may be desirable for gateway protocol functionality, such as firewalling, NAT, multicast, and IP forwarding, which process packets whose source and destination may both be on other nodes. Finally, LRP""s policy of giving to each TCP kernel thread the same priority as the respective receiving application may be appropriate for time-sharing scheduling, but not for scheduling with QoS guarantees. In the latter case, each application may want to give to its protocol processing a certain fraction of the application""s CPU allocation, or perhaps not be delayed by protocol processing while handling certain critical events. The LRP approach, due to its same-priority policy, is unable to provide this desirable flexibility.
The invention provides improved protocol processing techniques for use in operating systems. An illustrative embodiment of the invention, referred to herein as signaled receiver processing (SRP), overcomes the above-noted problems associated with the BSD and LRP approaches. In accordance with the invention, packet arrival causes a signal to the receiving process. The default action of this signal is to perform protocol processing. However, the receiving process can catch, block, or ignore the signal and defer protocol processing until a subsequent receive call. Therefore, in accordance with the invention, protocol processing is usually asynchronous with respect to application receive calls, but applications may opt for synchronous protocol processing. Applications may take advantage of the latter option, e.g., to control the fraction of the application""s CPU allocation that is spent processing protocols, or to prevent interruptions while processing certain critical events.
In order to support firewalling, NAT, and other gateway protocol functionality, the invention organizes protocol processing in stages. In the illustrative embodiment, each stage invokes a next stage submit (NSS) function to pass a packet to the respective next protocol processing stage. NSS uses a multi-stage early demultiplexing (MED) function. In the illustrative embodiment, the only stage that runs at interrupt level is the one that inputs packets from the network interface hardware. An end-application stage processes IP, TCP, and UDP protocols for packets destined to the host, and runs in the context of the receiving application. Other stages (e.g., firewall, NAT, IP forwarding) run in the context of system processes with configurable minimum proportional CPU shares (in operating systems that can guarantee such CPU shares).
Because protocol processing, in accordance with the invention, occurs only when a process is scheduled, the invention prevents BSD""s receive livelock problem described above. However, compared to LRP, the present invention has the advantage of being easily portable to systems that do not support kernel threads, such as FreeBSD. Additionally, the invention allows protocol processing to be always correctly charged, and consequently enables the system to enforce and honor proportional-share CPU allocations and other QoS guarantees. Furthermore, the invention does not require modifications to network interface hardware or drivers.