1. Field of the Invention
The present invention relates to networking and more particularly to maximizing page locality.
2. Description of the Related Art
In known networked computer systems, the network interface functionality is treated and supported as an undifferentiated instance of a general purpose Input Output (I/O) interface. This treatment is because computer systems are optimized for computational functions, and thus networking specific optimizations might not apply to generic I/O scenarios. A generic I/O treatment results in no special provisions being made to favor network workload idiosyncrasies. Known networked computer systems include platform servers, server based appliances and desktop computer systems.
Known specialized networking systems, such as switches, routers, remote access network interface units and perimeter security network interface units include internal architectures to support their respective fixed function metrics. In the known architectures, low level packet processing is segregated to separate hardware entities residing outside the general purpose processing system components.
The system design tradeoffs associated with networked computer systems, just like many other disciplines, include balancing functional efficiency against generality and modularity. Generality refers to the ability of a system to perform a large number of functional variants, possibly through deployment of different software components into the system or by exposing the system to different external workloads. Modularity refers to the ability to use the system as a subsystem within a wide array of configurations by selectively replacing the type and number of subsystems interfaced.
It is desirable to develop networked systems that can provide high functional efficiencies while retaining the attributes of generality and modularity. Networked systems are generally judged by a number of efficiencies relating to network throughput (i.e., the aggregate network data movement ability for a given traffic profile), network latency (i.e., the system contribution to network message latency), packet rate (i.e., the system's upper limit on the number of packets processed per time unit), session rate (i.e., the system's upper limit on creation and removal of network connections or sessions), and networking processing overhead (i.e., the processing cost associated with a given network workload). Different uses of networked systems are more or less sensitive to each of these efficiency aspects. For example, bulk data movement workloads such as disk backup, media streaming and file transfers tend to be sensitive to network throughput, transactional uses, such as web servers, tend to also be sensitive to session rates, and distributed application workloads, such as clustering, tend to be sensitive to latency.
Scalability is the ability of a system to increase its performance in proportion to the amount of resources provided to the system, within a certain range. Scalability is another important attribute of networked systems. Scalability underlies many of the limitations of known I/O architectures. On one hand, there is the desirability of being able to augment the capabilities of an existing system over time by adding additional computational resources so that systems always have reasonable room to grow. In this context, it is desirable to architect a system whose network efficiencies improve as processors are added to the system. On the other hand, scalability is also important to improve system performance over time, as subsequent generations of systems deliver more processing resources per unit of cost or unit of size.
The networking function, like other I/O functions, resides outside the memory coherency domain of multiprocessor systems. Networking data and control structures are memory based and access memory through host bridges using direct memory access (DMA) semantics. The basic unit of network protocol processing in known networks is a packet. Packets have well defined representations when traversing a wire or network interface, but can have arbitrary representations when they are stored in system memory. Network interfaces, in their simplest forms, are essentially queuing mechanisms between the memory representation and the wire representation of packets.
There are a plurality of limitations that affect network efficiencies. For example, the number of queues between a network interface and its system is constrained by a need to preserve packet arrival ordering. Also for example, the number of processors servicing a network interface is constrained by the processors having to coordinate service of shared queues, when using multiple processors; it is difficult to achieve a desired affinity between stateful sessions and processors over time. Also for example, a packet arrival notification is asynchronous (e.g., interrupt driven) and is associated with one processor per network interface. Also for example, the I/O path includes at least one host bridge and generally one or more fanout switches or bridges, thus degrading DMA to longer latency and lower bandwidth than processor memory accesses. Also for example, multiple packet memory representations are simultaneously used at different levels of a packet processing sequence with consequent overhead of transforming representations. Also for example, asynchronous interrupt notifications incur a processing penalty of taking an interrupt. The processing penalty can be disproportionately large considering a worst case interrupt rate.
One challenge in network systems relates to hiding system latencies. Application data that is sent over the network typically originates in the main memory of one system and is eventually delivered to the main memory of another system. Network performance of a computer system can significantly degrade if the memory access latency becomes too large. Some operations in a typical network interface implementation are serialized. Examples of these operations include access to a control data structure such as a descriptor ring that is stored in main memory, access to packet data and access to a control data structure such as a completion ring that is stored in main memory. Known I/O architectures and protocols enforce strict ordering of application data.
In known computer systems there may be one or more contributors to the system latency. These contributors include memory technologies that do not correspond to processor and networking speeds. Also, known computer systems may be based on a non-uniform memory access (NUMA) architecture which increases latency if the data cannot be held in the memory of the local processor. In known network systems it is often difficult to control where data is stored.
Some known high end networking systems which include many processors can make the system latency issue worse. Often, an increase in computational scalability also increases the system memory access latency to unacceptable levels from a network perspective.
Many known system include at least one bridge or switch. This bridge or switch adds hardware latency due to protocol conversion or buffering. Additionally, some bridges or switches require software intervention to function properly.
Input output memory management units (IOMMUs) can also generate system latencies. For example, systems that use a virtual memory (VM) model often require virtual address to physical address translation in hardware. The translation tables are hardware limited. If an entry is evicted from the translation table, the latency penalty can be significant. This issue is typical for networking systems because it is often difficult to control to where information is stored.