1. Field of the Invention
The present invention relates to communication systems, in particular, to an accelerated processor architecture for network communications.
2. Description of the Related Art
Network processors are generally used for analyzing and processing packet data for routing and switching packets in a variety of applications, such as network surveillance, video transmission, protocol conversion, voice processing, and internet traffic routing. Early types of network processors were based on software-based approaches with general-purpose processors, either singly or in a multi-core implementation, but such software-based approaches are slow. Further, increasing the number of general-purpose processors had diminishing performance improvements, or might actually slow down overall network processor throughput. Newer designs add hardware accelerators to offload certain tasks from the general-purpose processors, such as encryption/decryption, packet data inspections, and the like. These newer network processor designs are traditionally implemented with either i) a non-pipelined architecture or ii) a fixed pipeline architecture.
In a typical non-pipelined architecture, general-purpose processors are responsible for each action taken by acceleration functions. A non-pipelined architecture provides great flexibility in that the general-purpose processors can make decisions on a dynamic, packet-by-packet basis, thus providing data packets only to the accelerators or other processors that are required to process each packet. However, significant software overhead is involved in those cases where multiple accelerator actions might occur in sequence.
In a typical fixed-pipeline architecture, packet data flows through the general-purpose processors and/or accelerators in a fixed sequence regardless of whether a particular processor or accelerator is required to process a given packet. This fixed sequence might add significant overhead to packet processing and has limited flexibility to handle new protocols, limiting the advantage provided by the using accelerators.
Read latency and overall read throughput to storage devices with sequential access penalties, particularly memories external to a system on chip (SoC), can be performance bottlenecks for the SoC. For example, an external memory might include two or more substructures (e.g., multiple banks of DRAM). In such a system, a latency penalty might be incurred for sequential read requests to the same memory substructure. Several mechanisms have been developed for addressing this bottleneck. One mechanism queues read operations or requests (“read requests”) destined for each individual memory substructure and then selects read requests for non-busy substructures from one or more queues. Queuing works well when these read requests are spread evenly among the memory substructures, but fails if all the read requests target a particular substructure. Another mechanism duplicates the entire data structure multiple times with a number of copies and then selects a non-busy substructure as the target of the read request. This mechanism works well and overcomes some of the shortcomings of the other mechanism, but the amount of data stored by the memory is reduced by i) the inverse of the number of copies regardless of whether or not all of the data benefited from the duplication, or ii) the memory required increases as a multiple of the number of copies required.
In multi-threaded systems, multiple threads might have active functions to access data in one or more fragments of data in a single line entry of a data cache. It is desirable that these threads be able to access each fragment of data concurrently, so long as no two threads access the same fragment of data in the cache line entry. Typical approaches might ensure data coherency only on a cache line boundary, which might slow down function processing, due to head-of-line blocking for functions wanting to operate on non-overlapping fragments of data within the same cache line.