A network processing unit (NPU), a.k.a. a packet forwarding engine (PFE), is a an integrated circuit (IC) designed and optimized for processing a network packet (packets) that contains header information composed of network address and protocol fields and a user data payload (the data unit at layer 3 of the open system interconnection (OSI) model). The PFE is tasked with performing functions on the header such as computation, pattern matching, manipulation of certain bits within the protocol fields, key lookup (for an internet protocol (IP) address) in a table, etc., for applications such as quality of service (QoS) enforcement, access control monitoring, packet forwarding, etc. in products such as routers, switches, firewalls, etc. found on a private network, e.g., a LAN, or on a public network, e.g., the Internet.
PFE packet processing rates currently exceed the tens of millions of packets per second (Mpps). Thus, a substantial amount of data has to be processed by the PFE. To cope with this high bandwidth requirement, PFEs utilize multiple-processing cores and multi-threading. The PFE stores data in, and fetches data from, off-chip memory such as dynamic random access memory (DRAM) chips. This off-chip memory is used to store data such as IP addresses for forward hops, traffic management statistics, QoS data, etc. The off-chip memory typically has a memory access controller (MAC) that performs simple operations such as reading data from memory and writing data to memory. Operations that are more sophisticated are typically performed by the PFE. Latency is incurred in any transfer of data to and from the PFE because of the processing time required to frame and transmit the data in a packet to and from the multiple chip interfaces. Pipelining helps to fill empty cycles, but latency still occurs.
Using a data cache and/or instruction cache on the PFE chip can help reduce latency in retrieving data or instructions from off-chip memory, by storing frequently used and prefetched data and instructions temporarily on-chip. A high-level cache, i.e., L1, is slaved to the on-die processor for the PFE. An on-die cache is not used as a main memory for storing a primary source of data from which resources other than the processor associated with the cache would then read. Latency is still incurred sending data back and forth between the on-die cache and off-chip memory. Because the data stored in the cache is a copy of the data stored in the off-chip memory, administrative overhead may be required to maintain coherency of data by synchronizing the copy of data stored in the cache versus the original data stored in one or more external memory devices, such as external buffer memory or external main memory. Sometimes an algorithm running on a PFE will repetitively fetch data stored on main memory for repetitive operations or frequent updates. If the cache has to be updated for each of these repetitive operations, then the fetch from external memory and the write back to external memory both incur latency.
Access throughput for many large data structures such as network address tables does not improve with data caches. The random nature of arriving packets from all points of the network, the fine grain nature of the actual data structure, and the sparse diffuse structure can make it difficult to hold enough of the data structure in the data cache at any one-time span to make a statistical improvement in performance. This is known as poor temporal locality quality of the data structure. Therefore, it is often better to reduce the latency to memory by reducing the physical and electrical distance between the processor and the actual copy of the data structure. Often it is infeasible to put the whole data structure in on-chip memory of the PFE. However, moving the data off chip brings back the latency problem.
If a chip has an onboard microprocessor or microcontroller, then many memory accesses to an on-chip memory are typically processed by the microprocessor or microcontroller first. Otherwise, a direct access to the on-chip memory by an external host might alter data in the on-chip memory on which the microprocessor or microcontroller relies. Additionally, if the microprocessor or microcontroller is configured primarily as a special function microprocessor or microcontroller that does not normally access data in the on-chip memory, then an override function may be necessary to enable that microprocessor or microcontroller to make a special memory access to the on-chip memory. This may require an interrupt to the memory controller in order to drop current and newly arriving external accesses during the time required for the special memory access to complete its operation.
A PFE can include a complex on-die processor capable of sophisticated functions. The operations required for packet processing can range from simple to complex. If a separate coprocessor chip is utilized on a line-card to offload less sophisticated operations from the PFE, then the coprocessor has the same latency while fetching and storing data to and from an off-chip memory. If the coprocessor has cache memory on die, then the same coherency overhead arises for synchronizing data between the on-die cache and off-chip memory. Moreover, if data from an external memory is shared between two or more other devices, e.g., a coprocessor cache and an NPU cache, then the complexity of the coherency can increase. Complex process signaling, mutual exclusion protocols or multi-processor modified-exclusive-shared-invalid (MESI) protocols have been developed to facilitate data sharing. Even with these solutions deadlock conditions can still occur.
A typical coprocessor is slaved to only one host in order to simplify accesses and commands from only one source. If more than one host were coupled to and communicating with a single coprocessor resource, then tracking and tracing of the source of a command would be required in order to return the data to the correct requestor. If the shared coprocessor resource has multi-threading capability for one or all of the multiple hosts coupled to it, then the overhead in managing the threads can be substantial.
Creating a memory coprocessor with fixed specialized abstract operations for a specific application can make the market too narrow, thus making the product less economically feasible.
The same design and application concerns mentioned herein also arise for processors other than network processors. For example, general-purpose graphics processor units (GPGPUs), multi-core workstation processors, video game consoles, and workstations for computational fluid dynamics, finite element modeling, weather modeling, etc. would involve similar concerns.