Network devices, such as switches and routers, are designed to forward network traffic, in the form of packets, at high line rates. One of the most important considerations for handling network traffic is packet throughput. To accomplish this, special-purpose processors known as network processors have been developed to efficiently process very large numbers of packets per second. In order to process a packet, the network processor (and/or network equipment employing the network processor) needs to extract data from the packet header indicating the destination of the packet, class of service, etc., store the payload data in memory, perform packet classification and queuing operations, determine the next hop for the packet, select an appropriate network port via which to forward the packet, etc. These operations are generally referred to as “packet processing” operations.
Modern network processors (also commonly referred to as network processor units (NPUs)) perform packet processing using multiple multi-threaded processing elements (e.g., processing cores) (referred to as microengines or compute engines in network processors manufactured by Intel® Corporation, Santa Clara, Calif.), wherein each thread performs a specific task or set of tasks in a pipelined architecture. During packet processing, numerous accesses are performed to move data between various shared resources coupled to and/or provided by a network processor. For example, network processors commonly store packet metadata and the like in static random access memory (SRAM) stores, while storing packets (or packet payload data) in dynamic random access memory (DRAM)-based stores. In addition, a network processor may be coupled to switch fabrics, cryptographic processors, hash units, general-purpose processors, and expansion buses, such as the PCI (peripheral component interconnect) and PCI Express bus.
Network processors are often configured to perform processing in a collaborative manner, such as via a pipelined processing scheme. Typically, different threads perform different portions of the same task or related tasks, with the output of one thread being employed as an input to the next thread. The threads are specifically tailored for a particular task or set of tasks, such as packet forwarding, packet classification, etc. This type of scheme enables packet-processing operations to be carried out at line rates for most packets, also referred to as “fast path” operations.
In general, the foregoing packet processing operations require multiple memory accesses to one or more memory units. As a result, packet throughput is inherently related to memory (access) latencies. Ideally, all memory accesses would be via the fastest scheme possible. For example, modern on-chip (i.e., on the processor die) static random access memory (SRAM) provides access speeds of 10 nanoseconds or less. However, this type of memory is very expensive (in terms of chip real estate and chip yield), so the amount of on-chip SRAM memory on an NPU (e.g., shared scratch memory and memory and caches local to each compute engine) is typically very small.
The next fastest type of memory is off-chip SRAM. Since this memory is off-chip, it requires a special interface (e.g., bus) to access it, adding a level of latency to the memory access. However, it still has relatively-low latency.
Typically, various types of off-chip dynamic RAM (DRAM) are employed for use as “bulk” memory units. Dynamic RAM is slower than static RAM (due to physical differences in the design and operation of DRAM and SRAM cells), and must be refreshed every few clock cycles, taking up additional overhead. As before, since it is off-chip, it also requires a special bus to access it. In most of today's network processor designs, DRAM memory stores with enhanced performance are employed, including RDRAM (Rambus DRAM) or DDR DRAM (double data rate), RLDRAM (reduced latency RAM) etc. via dedicated signals. As used herein, a memory unit comprises one or more memory storage devices having associated memory spaces.
An application designer faces the challenging task of utilizing the memory units available to an NPU in such a fashion as to ensure that a minimum amount of latency is incurred during packet processing operations, in order to maximize the packet throughput. Currently, memory unit utilization is done on a trial and error or educated guess basis in consideration of projected traffic patterns and service levels to be provided by the network element in which one or more NPU's are installed. This produces inefficient memory utilization, reducing packet throughput. Also when the designers use faster cache memories such as CAMs (Content Addressable Memories) and TCAMs (Temiary Content Addressable Memories) etc. to enhance packet processing, they are not sure of or can't quantify the lookup hit/miss rate success in the CAM/TCAM or any type of cached faster memories for a given set of packet flows arriving at the NPU.