1. Technical Field
This invention relates to input processing for computer networks and, more particularly, to technology for improving throughput in such systems having multiprocessor implementations.
2. Description of Related Art
In computer network systems, notably those which are UNIX operating system based, network traffic from a plurality of clients and servers incoming on the network is in need of processing. This network input processing for a given network input/output (I/O) device has, in the past, always been single threaded processing at the interrupt level, whereby all inbound packets from a network device are processed sequentially. Early in the stages of computer networks, this was not necessarily a significant problem.
However, with the maturation of computer technology, it is becoming more commonplace to encounter computer network systems involving a number of CPUs present on the system (being referred to in the art as multiprocessor or MP systems). The weakness of single threaded processing, as a result of the evolution of MP systems, has now become apparent due to bottlenecks caused by this single thread processing. In the past, with single CPU systems, this bottleneck was not as visible. However, as noted with the advent of MP systems, internet servers have grown bigger and faster, with multiple CPUs with 8-way and 12-way CPU systems (e.g., including 8, 12 or more CPUs) becoming more and more commonplace. The inherent weakness of this single thread processing mode is that the aforementioned network input, in accordance with prior art technology, is processed only by a single CPU at any given time, regardless of the number of CPUs on the system available for such processing.
Therefore, a system and method is highly desired, given this maturation of computer network technology into MP systems, whereby such network input processing can take better advantage of the MP scalability, so as to improve network throughput on larger network servers.
In an effort to address this problem, various systems have been developed, employing differing techniques for queuing inbound packets. However, such systems suffer from very serious drawbacks. First, they are not tied to MP scalability. Furthermore, they do not address the problem of out-of-order packets caused by distributing the incoming packets to multiple parallel processing nodes.
Queuing is a method that has long been known for seeking to parallelize processing in order to increase throughput and distribute workloads. However, a serious problem with this, in the context of the present invention, is that, in accordance with such conventional queuing, one of a multiple of CPUs can, in fact, obtain a packet from a queue for processing; however, there is no assurance, with such multiple processors obtaining packets in this manner, that the packet order will be maintained. It is extremely important that this order be maintained for upper level network protocols.
Once systems expanded to more than one CPU, to process packets for throughput and concurrency, loss of control of scheduling the packets in their order occurred in previous systems. While this, in and of itself, was not fatal to operation of multiprocessor systems employing queues, once packets are out of order and flowing up to an endpoint of the system, additional resources must be expended in order to process and correctly re-sequence these packets in the protocol stack. This additional processing in order to ensure correct packet sequencing, in itself, is time consuming, so as to result in little net gain otherwise afforded by employing queues, multiple CPUs, and parallelization in the first place.
One practical problem resulting from the inability to provide for network input distributed processing is that throughput of an individual CPU on the order of 100 megabits per second is less than that of network adapter cards, which may nominally have throughputs of one gigabyte per second, i.e., operating at a 10xc3x97 factor faster than the CPU. In accordance with conventional prior practice, wherein no more than one CPU at a time could be processing packets associated with one of the network I/O devices, the net result was that network throughput was CPU bound. That is to say, throughput could not exceed the capacity of a single CPU running interrupts and processing incoming packets from a single given physical interface. Thus, network throughput was CPU bound, even in MP systems. It became increasingly difficult to justify to potential customers of MP systems why they should invest significant amounts of money for these systems without enjoying a concomitant gain in network I/O performance. Similarly, it became increasingly difficult to justify that a customer should invest in faster and more expensive network adapter cards (which, as noted, may, in some instances, have a capability 10 times faster than the CPUs themselves) when, upon installing such an adapter, the customer still does not see a 10xc3x97 performance increase (due to the aforementioned bottleneck caused because only a single CPU is servicing incoming traffic at any given time, notwithstanding the presence of other processors with concurrent processing capability). Therefore, there was a need to demonstrate to the customer, improvements in system performance to justify the associated cost of adding additional CPUs in MP systems, as well as more expensive adapter cards.
Although queuing inbound packets was known, these prior efforts were not tied to MP scalability, and such efforts did not address the aforementioned problem of out-of-order packets. It will be appreciated that this is a key shortcoming of prior art attempts to solve the problem of distributing random input to multiple engines while, nevertheless, maintaining the important input sequence for the upper layer protocol (mainly TCP/IP) to work properly. As previously noted, these out-of-order packets cause severe performance problems for such protocols as TCP or UDP due, in part, to the overhead associated with sorting out the proper packet sequences.
Other efforts for accessing fuller processing capacity available from MP systems include performing a hash function for providing a sequencing to the packets received by the device driver. The hash function is performed subsequent to the packets being received from the network adapter by the host memory. While this scheme provides queuing and parallelization, it relies heavily on a scheduler for scheduling CPUs. Schedulers that place a low priority on input processing of packets significantly reduce the efficiency of the MP system. Furthermore, the connection between individual network adapters and device drivers remains single threaded, thereby reducing the transfer rate between the two.
Network input processing is distributed to multiple CPUs on multiprocessor systems to improve network throughput and to take advantage of MP scalability. Incoming packets are received by the network adapter and are distributed to N CPUs for high priority processing, wherein N is the number of receive buffer pools set up by the device driver, based on N CPUs being available for input processing of packets. Each receive buffer pool has an associated CPU. Packets are direct memory accessed to one of the N receive buffer pools by using a hashing function based on the source MAC address (source hardware address), source IP address, or the packet""s source and destination TCP port number, or all or a combination of the foregoing. The hashing mechanism ensures that the sequence of packets within a given communication session will be preserved. Distribution is effected by the network adapter, which sends an interrupt to the CPU corresponding to the receive buffer pool, subsequent to the packet being DMAed into the buffer pool, thereby optimizing the efficiency of the MP system by eliminating any reliance on the scheduler and increasing the bandwidth between the device driver and the network adapter, while maintaining proper packet sequences.