Parallel processing is a fundamental technique to improve the performance in hardware. But once a sequence of jobs is separated for parallel processing, the original ordering of the sequence may be lost. If the ordering is important, software/hardware will synchronize with the other jobs before it is released to the next processing stage.
In a network processor, when sequences of packets are sent out to multiple processors for parallel processing, the processing time for each packet can vary based upon the packet types. To reduce the processor idling time waiting for other packets to synchronize within a sequence, a reordering hardware block is employed to maintain the packet ordering, and to provide a synchronization point in the system. Once the processor has processed a packet, it flushes an identifier to the reordering block. The reordering block may hold the packet identifier until older packets have been processed, and reestablish the original ordering.
The general implementations of the reordering block are usually performed by a hardware linked list. Before packets enter the multi-processor domain, a linked list is established for different packet sequences, also referred to as a context. Typical linked list operations include: 1) enqueue operation—allocation of a node; 2) dequeue operation—deallocation of a node; 3) walk the chain (WTC) operation—seeking for a next node of a linked list; and 4) flush operation—indicating that operations of a node are completed and it can be dequeued.
In many cases, simple packet ordering is necessary, but not sufficient, as critical code segments within multiple packets need to be executed serially in order to perform the necessary function. In this case, it requires a mechanism not only to maintain packet ordering, but also to: 1) partition the input packet stream into substreams, subsequent packet ordering only needs to be maintained within a substream; and 2) serialize critical code segments within a substream without hindering parallelism in non critical code segments.
To address these two requirements, the concept of “reassign” is required. This function has to run at packet rate and to be performed multiple times per packet. During the packet processing, a program can decide to reassign a packet from the original context to a new one. This function is referred to as “reassign.” A reassign command can further be broken down into a dequeue operation from the original context and an enqueue operation to a new context.
There are several techniques to implement the reordering block that can manage a typical linked list operation and perform a reassign function. The first technique is to use registers to implement the linked list structures, which allow multiple linked list processing operations to be performed in parallel. In additional, each linked list may need separate state control logic. The second technique is to use single port memory to implement the linked list structures. Single port memory can only read one data per cycle and the processing pipeline can only perform a single operation per cycle at best.
The register technique provides a high performance solution. But as the number of contexts and nodes increases, it becomes cost ineffective and it is not scalable. The single port memory technique does not provide a performance beyond one operation per cycle, assuming it is fully pipelined. In order to support reassign commands, the pipeline would need to stall to absorb the above two commands: reassign enqueue and WTC command for the original context. If every element requires supporting reassign once, the performance will be reduced in half. Therefore, using a single port memory technique is unable to support multiple or one “reassign” per packet.