In high-performance packet processing, such as network-load balancing and deep packet inspection, it is common to use multiple packet engines running on different cores or even on different microprocessors. This configuration allows multiple packets to be processed in parallel since multiple packet engines process different packets at the same time. After the packet engine finishes processing the packet, the packet can then be prioritized and/or regulated to a certain rate using a central Quality-of-Service (QoS) device for sending through a bottleneck link (i.e., a physical device that transmits all packets processed by the multiple packet engines).
In order for the QoS device to send out all packets processed by the multiple packet engines, the QoS device can either receive a copy of a packet from the packet engine or share a packet memory with the packet engine. But performance penalties exist under both approaches. The first approach involves an inefficient copy operation for each packet provided to the QoS device, which must then process received packets before providing them to the link. And the second approach involves a central storage (such as a memory of a parallel processing system or a cache of a microprocessor) shared by the QoS device and the multiple packet engines. The central storage would store all packets and would allow the QoS device to have access to them. In this latter approach, however, because multiple packet engines can reside in different cores or different processors, sharing the storage place can cause a cache coherency issue such that the cache contents can be undesirably invalidated.