Network processors (NP) are emerging as a core element of high-speed communication routers and they are designed specifically for packet processing applications. Such applications usually have stringent performance requirements. For instance, OC-192 (10 Gigabits/sec) POS (Packet over SONET) packet processing requires a throughput of 28 million packets per second or service time of 4.57 microseconds per packet for transmission and receipt in the worst case.
On the other hand, the latency for an external memory access in NPs is usually larger than the worst-case service time. In order to address the unique challenge of packet processing, (e.g., maintaining stability while maximizing throughput and minimizing latency for the worst-case traffic,) modern network processors usually have a highly parallel architecture. For instance, some network processors, such as, Intel IXA NPU family of network processors (IXP), includes multiple microengines (e.g., programmable processors with packet processing capability) running in parallel and each microengine supports multiple hardware threads.
Consequently, the associated network applications are also highly parallel and usually multi-threaded to compensate the long memory access latency. Whenever a new packet arrives, a series of tasks (e.g., receipt of the packet, routing table look-up, and enqueueing) is performed on that packet by a new thread. In such a parallel programming paradigm, modifications to global resources, such as a location in the shared memory, are protected by critical sections to ensure mutual exclusiveness and synchronization between threads.
Each critical section typically reads a resource, modifies it, and writes it back (RMW). FIG. 1 is a block diagram illustrating a conventional external memory accesses by multiple threads. As shown in FIG. 1, if more that one thread is required to modify the same critical data, a latency penalty will be incurred for each thread if each accesses the external memory. Referring to FIG. 1, each of the threads 101-104 has to be executed in sequence. For example, thread 102 has to wait thread 101 to finish the operations read, modification, and write back to the external memory before thread 102 can access the same location of the external memory.