This invention relates to the handling of connections over a network at a data processing system having a plurality of processor cores.
In conventional networked personal computers and servers having more than one CPU core, the processing of received data packets is usually performed on just one of the CPU cores. When incoming packets are received at the network controller of the computer or server (generically, a data processor), they are delivered into host memory via a receive channel. A receive channel has an associated notification mechanism, which serves to inform the host software of the arrival of packets. Typically the notification mechanism includes an interrupt, which is raised when received packets are available for processing. In response to the interrupt an interrupt service routine is invoked on one of the CPU cores, and causes that CPU core to perform the work associated with processing the incoming packets. This work can be significant, and at high data rates this CPU core typically becomes a bottleneck in the system.
The above problem is especially severe with high data rate network architectures such as 10 and 100 Gb Ethernet. Current single processor designs struggle to cope at peak high throughput of data and any sources of inefficiency in the handling of incoming data packets must be minimised if the promise of such high data transfer speeds is to be realised.
FIG. 1 shows a typical structure of a data processing system having a monolithic operating system architecture. Kernel 103 manages hardware such as a network interface controller (NIC) 101 by means of driver 107 and controls the resources of the system. The kernel receives hardware interrupts 115 at interrupt handler 113 and, in response to being notified that there is available incoming packet data, performs receive processing of the data. The processed traffic data is delivered to the appropriate socket 111 and application 109, which executes in user space 105.
With the aim of mitigating some of the problems described above, Microsoft has developed the Receive Side Scaling (RSS) architecture which improves performance by spreading the processing load across multiple CPU cores. Each received packet is delivered to the host via one of a number of receive channels, each associated with a notification mechanism. Each receive channel and notification mechanism pair is associated, typically via an interrupt, with a different CPU core, so that the packets delivered to different receive channels are processed on different CPU cores. It is arranged that all packets of a given data flow are delivered to a single receive channel, and so are processed at the same CPU. This is necessary to ensure that packets of a given data flow are processed in the order that they arrive.
Conventional operating systems typically maintain state associated with each data flow, and RSS ensures that when processing received data, the state for a given flow is always accessed by the same CPU core. However, the state associated with data flows may also be accessed in response to requests from applications or other software entities, which could potentially execute on any CPU cores in the system. Thus, in certain situations, any given state might be accessed by means of a different CPU core to the one handling received data of the data flow to which the state relates. This causes the state to be moved between the caches of the respective CPU cores, resulting in inefficient handling of the request.
Conventional methods for distributing packet processing over multiple processors, such as RSS, suffer from two main problems:
(i) Locks
State information relating to a particular data flow may be accessed by code executing on multiple processors and must therefore be protected from concurrent access. Typically this is achieved through the use of state locks. When locks are not contended they incur a relatively small (but still significant) overhead. However, when locks are contended, the loss in efficiency is very high. This can occur when a receive path executes on more than one processor core and each core is required to access the same state information of that receive path. In particular, while a kernel thread running on a processor is blocked waiting for a lock, that processor will probably not be able to perform any useful work. Processors in conventional multi-processor networked systems can spend a significant time waiting for locks.
(ii) Cache Effects
As the network stack executes on a processor, any state in host memory that it touches (reads or writes) will be copied into the cache(s) close to that processor core. When state is written, it is purged from the caches of any other processor cores. Thus, in the case when a network stack executes concurrently on multiple cores, if more than one processor writes to shared cache lines in the state of the stack the cache lines will bounce between the cores. This is highly inefficient since each cache write operation to the network stack state by a particular processor causes the other processors handling that stack to purge, and later reload, those cache lines.
Where locks are used to protect shared state, the memory that implements those locks is itself shared state, and is also subject to cache-line bouncing.
Lock-free techniques for managing concurrent access to shared state may not suffer from the blocking behaviour of locks, but do suffer from cache-bouncing.
There is therefore a need for an improved method of distributing the processing load associated with handling network packets in data processing systems having multiple CPU cores.