Network processors are generally used for analyzing and processing packet data for routing and switching packets in a variety of applications, such as network surveillance, video transmission, protocol conversion, voice processing, and internet traffic routing. Early types of network processors were based on software-based approaches with general-purpose processors, either singly or in a multi-core implementation, but such software-based approaches are slow. Further, increasing the number of general-purpose processors had diminishing performance improvements, or might actually slow down overall network processor throughput. Newer designs add hardware accelerators in a system on chip (SoC) architecture to offload certain tasks from the general-purpose processors, such as encryption/decryption, packet data inspections, and the like.
Network processors implemented as an SoC having multiple processing modules might typically employ one or more general-purpose processors and one or more hardware accelerators, the hardware accelerators implementing well defined procedures to improve the efficiency and performance of the SoC. However, the general-purpose processors might be required for certain packet processing functions, such as deep-packet inspection, that might not be efficiently implemented using the hardware accelerators alone. Further, overall throughput of the SoC might be limited where the processors “stall” waiting for packet data to be become available for processing when using memory, particularly memories external to the SoC, to communicate between the accelerators and the processors. For example, if a processor core tries to access memory addresses which are not in its cache and the memory system has to go to other memory (e.g., dynamic random access memory or “DRAM”) to get them, it can cause the processor core to stall for hundreds of processor clock cycles per address to wait for the memory system to deliver the requested data to processor core. In another example, an external memory might include two or more substructures (e.g., multiple banks of DRAM). In such a system, a latency penalty might be incurred for multiple access requests to the same memory substructure. Additionally, a given set of operations for a data flow might be required to be completed in a given order, further adding to latency. Thus, a technique for reducing latency when accessing memory is desirable.