Network devices, such as switches and routers, are designed to forward network traffic, in the form of packets, at high line rates. One of the most important considerations for handling network traffic is packet throughput. To accomplish this, special-purpose processors known as network processors have been developed to efficiently process very large numbers of packets per second. To process a packet, the network processor (and/or network equipment employing the network processor) extracts data from the packet header indicating the destination of the packet, class of service, etc., stores the payload data in memory, performs packet classification and queuing operations, determines the next hop for the packet, selects an appropriate network port via which to forward the packet, etc. These operations are generally referred to as “packet-processing” or “packet-forwarding” operations.
Modern network processors (also referred to as network processor units or NPUs) perform packet processing using multiple multi-threaded processing elements (e.g., processing cores) (referred to as microengines or compute engines in network processors manufactured by Intel® Corporation, Santa Clara, Calif.), wherein each thread performs a specific task or set of tasks in a pipelined architecture. During packet processing, numerous accesses are performed to move data between various shared resources coupled to and/or provided by a network processor. For example, network processors commonly store packet metadata and the like in static random access memory (SRAM) stores, while storing packets (or packet payload data) in dynamic random access memory (DRAM)-based stores. In addition, a network processor may be coupled to cryptographic processors, hash units, general-purpose processors, and expansion buses, such as the PCI (peripheral component interconnect) and PCI Express bus.
In general, the various packet-processing compute engines of a network processor, as well as other optional processing elements, will function as embedded specific-purpose processors. In contrast to conventional general-purpose processors employed on personal computers and servers, the compute engines do not employ an operating system to host applications, but rather directly execute “application” code using a reduced instruction set tailored for packet-processing tasks. For example, the microengines in Intel's® IXP2xxx family of network processors are 32-bit RISC processing cores that employ an instruction set including conventional RISC (reduced instruction set computer) instructions with additional instructions specifically tailored for network packet-processing.
The services supported by a typical network device or system may be numerous. For example, typical services include packet-forwarding with and without Quality of Service (QoS) levels, security, Voice over IP (VoIP), streaming video, subscriber differential services, etc. To effect each particular service, a specific set of code or code modules is developed that is tailored for that service. Additionally, sets of services are typically grouped into an application that is run on the network device. Moreover, a given network device may run one or more applications.
Typically, the application code is generated in the following manner. First, the developers write source code targeted for a particular application and a particular execution environment (e.g., a particular NPU or processing architecture employing multiple NPUs, or multiple single- and/or multi-core processors). The source code is fed into a compiler that generates an intermediate representation comprising original binary code with added instrumentation code. The intermediate representation code is executed in the targeted environment (actual hardware or a virtual model) with what is deemed a representative input (e.g., training data), and profiling statistics are gathered via hooks in the instrumentation code. The statistics, along with the original binary code are then fed to the compiler, which generates a binary executable that is optimized based on the profiling statistics.
This approach has several problems. First, the optimized code is only as good as the provided training data. If the real-world data encountered diverges greatly from the training data, the application may perform sub-optimally. Second, if the real-world workload (i.e., traffic conditions) varies over time, as is very common in many network systems, the single, static executable is unable to adapt and optimize itself for the change in workload. Third, for a system that will encounter varying workloads over time, attempting to structure the training data so that it represents all or most of the typically workload scenarios that might be encountered during actual operations leads to a situation where it is very unlikely that the executable is optimized for any individual workload—resulting in a “jack-of-all-trades-master-of-none” situation.
In view of the foregoing, program developers must make a tradeoff of either (1) optimizing their code to handle one case very well and hope for acceptable performance for any traffic condition that doesn't match the optimized case; or (2) attempt to get the best average performance, knowing that their system will never have the best performance for any individual type of network-packet traffic. This is especially true for network systems that support a large number of services, often placing very different kinds of demands on system resources.