This invention relates to the reception and processing of data packets at a data processing system having a plurality of processor cores.
In conventional networked personal computers and servers having more than one CPU core, the processing of received data packets is usually performed on just one of the CPU cores. When incoming packets are received at the network card of the computer or server (generically, a data processor), they are delivered into host memory via a delivery channel. A delivery channel has an associated notification mechanism, which serves to inform the host software of the arrival of packets. Typically the notification mechanism includes an interrupt, which is raised when received packets are available for processing. In response to the interrupt an interrupt service routine is invoked on one of the CPU cores, and causes that CPU core to perform the work associated with processing the incoming packets. This work can be significant, and at high data rates this CPU core typically becomes a bottleneck in the system.
The above problem is especially severe with high data rate network architectures such as 10 and 100 Gb Ethernet. Current single processor designs struggle to cope at peak high throughput of data and any sources of inefficiency in the handling of incoming data packets must be minimised if the promise of such high data transfer speeds is to be realised.
FIG. 1 shows a typical structure of a data processing system having a monolithic operating system architecture. Kernel 103 manages hardware such as a network interface device (NIC) 101 by means of driver 107 and controls the resources of the system. The kernel receives hardware interrupts 115 at interrupt handler 113 and, in response to being notified that there is available incoming packet data, performs receive processing of the data. The processed traffic data is delivered to the appropriate socket 111 and application 109, which executes in user space 105.
With the aim of mitigating some of the problems described above, Microsoft has developed the Receive Side Scaling (RSS) architecture which improves performance by spreading the processing load across multiple CPU cores. Each received packet is delivered to the host via one of a number of delivery channels, each associated with a notification channel. Each notification channel is associated, typically via an interrupt, with a different CPU core, so that the packets delivered to different delivery channels are processed on different CPU cores. It is arranged that all packets of a given data flow are delivered to a single delivery channel, and so are processed at the same CPU. This is necessary to ensure that packets of a given data flow are processed in the order that they arrive.
RSS aims to provide a solution to the above problems for the monolithic Microsoft Windows operating system, and is also used on other monolithic operating systems. However, there are other multi-CPU system architectures, such as virtualised systems supporting multiple operating systems or systems supporting untrusted packet processing entities, in which RSS is not helpful because received packets are forwarded to other software domains that may not run on the same CPU core.
The architecture of a typical virtualised system is illustrated in FIG. 2A. Virtualised operating system instances 205 are generally untrusted and control of the hardware and resource allocation falls to a hypervisor or trusted domain 203. The hypervisor manages the hardware and the virtualised operating system instances. Each virtualised OS instance can access the network via a software-emulated network interface 207, which is typically implemented as a communication channel between the virtualised OS and the hypervisor. Packets received by the real network interface controller (NIC) 201 are delivered to the hypervisor, which routes them to the appropriate virtualised domain via the software-emulated network interface. A problem with this approach is that it incurs significant additional processing overheads when compared with the native OS receive path, and the forwarding of packets through the hypervisor can become a bottleneck in the system.
Some smart NICs are able to support multiple protected interfaces for sending and receive packets, known as virtualised network interface controllers (VNICs). Each virtualised OS domain may be given direct access to a VNIC via a memory mapping onto the NIC hardware or via a shared memory segment. The virtualised OS uses a VNIC to receive packets directly from the NIC, bypassing the hypervisor and associated forwarding overheads. Each VNIC includes a delivery channel for delivering packets and a means to notify to the virtualised OS. Such smart NICs typically have a filter table or forwarding table that maps received packets to the appropriate VNIC and virtualised OS. Received packets that do not map to any VNIC may be delivered via a default delivery channel to the host domain.
In some configurations the notification means in a VNIC includes an interrupt, which is able to invoke the virtualised OS directly when packets arrive. Alternatively the virtualised OS may be invoked via a virtual interrupt. In the latter case, instead of raising an interrupt, a virtual interrupt notification is sent to the hypervisor via a default notification channel. The hypervisor receives this virtual interrupt notification and in response invokes the virtualised OS via a virtual interrupt.
Such accelerated virtualised network I/O for virtualised OSs described above improves performance considerably. However, processing of all packets received by a guest domain is performed on just one CPU core, which may therefore become a bottleneck in the system.
Conventional methods for distributing packet processing over multiple processors, such as RSS, suffer from two main problems:
(i) Locks
State information relating to a particular data flow may be accessed by code executing on multiple processors and must therefore be protected from concurrent access. Typically this is achieved through the use of state locks. When locks are not contended they incur a relatively small (but still significant) overhead. However, when locks are contended, the loss in efficiency is very high. This can occur when a receive path executes on more than one processor core and each core is required to access the same state information of that receive path. In particular, while a kernel thread running on a processor is blocked waiting for a lock, that processor will probably not be able to perform any useful work. Processors in conventional multi-processor networked systems can spend a significant time waiting for locks.
(ii) Cache Effects
As the network stack executes on a processor, any state in host memory that it touches (reads or writes) will be copied into the cache(s) close to that processor core. When state is written, it is purged from the caches of any other processor cores. Thus, in the case when a network stack executes concurrently on multiple cores, if more than one processor writes cache lines in the state of the stack the cache lines will bounce between the cores. This is highly inefficient since each cache write operation to the network stack state by a particular processor causes the other processors handling that stack to purge, and later reload, those cache lines.
Where locks are used to protect shared state, the memory that implements those locks is itself shared state, and is also subject to cache-line bouncing.
Lock-free techniques for managing concurrent access to shared state may not suffer from the blocking behaviour of locks, but do suffer from cache-bouncing.
There is therefore a need for an improved method of distributing the processing load associated with handling network packets in data processing systems having multiple CPU cores.