1. Field
This application relates to communication networks and, more particularly, to a method for thread reduction in a multi-thread packet processor.
2. Description of the Related Art
Data communication networks may include various switches, nodes, routers, and other devices coupled to and configured to pass data to one another. These devices will be referred to herein as “network elements”. Data is communicated through the data communication network by passing protocol data units, such as frames, packets, cells, or segments, between the network elements by utilizing one or more communication links. A particular protocol data unit may be handled by multiple network elements and cross multiple communication links as it travels between its source and its destination over the network.
When a packet is received by a network element, the network element will process the packet and forward the packet on to its destination. To accelerate packet processing, a multi-thread packet processor may be used in which an execution pipeline is used to processes packets and each packet is assigned to a thread. Each thread processes a packet and has its own dedicated context, such as a program counter, link registers, address registers, data registers, local memories, etc. To increase performance, two or more execution pipelines may be used to process packets in parallel.
A programmable/microcodeable fine-grained multi-threaded packet processor may be viewed as a single physical execution pipeline shared among multiple threads. Packet processors of this class do not implement any bypass pipeline stages, thus eliminating pipeline hazards such as resource conflicts, branch delays, pipeline stalls, etc. This means, however, that when a thread dispatches an instruction from the execution pipeline, it must retire that instruction before it dispatches the next instruction into the pipeline. If an instruction requires multiple cycles, the thread will go idle until the instruction has been retired. To accelerate particular operations and minimize the amount of idle time each thread spends waiting for particular instructions to be retired the multi-threaded packet processor may incorporate dedicated hardware accelerators. For example, the multi-threaded packet processor may include hardware accelerators for key lookup operations, such as to perform MAC address lookup operations, IP address lookup operations, and implement n-tuple filters.
In a pipeline processor of this nature, not all stages of the pipeline take the same amount of time. For example, a memory lookup operation (e.g. key lookup operation) may take many cycles to return a value. Indeed, one challenge with key lookup operations is the numerous memory accesses that are needed to perform that particular operation. This leads to large latencies due to memory accesses. One way to hide the coprocessor latency associated with implementing key lookup operations is to hide the latency by increasing the number of threads per execution pipeline. Unfortunately, since each thread is heavy on context, increasing the number of threads leads to more logic and larger design implementations.
The minimum number of threads needed to completely hide coprocessor latency and still maintain full instruction dispatch rate into the execution pipeline without any empty pipe stages (no bubbles) may be calculated as T≧n+m*Σci, where T is the minimum number of threads required to fill the pipeline, n is the number of pipe states in the main execution pipeline, m is the number of key lookup coprocessors, and ci is the number of pipe stages including the total memory accesses per coprocessor. For example, assume that a pipeline has 20 stages, and that one of the stages takes a coprocessor 10 cycles to complete. To fully fill the pipeline, such that one packet is being output from the pipeline during every cycle, it would be necessary to use 30 threads to fully fill the pipeline.
Increasing the number of threads has a further impact on the amount of time it takes to process any given packet (latency). Specifically, if a thread goes idle for a number of cycles while a coprocessor is executing a lookup, this idle time increases the amount of time it takes to process the packet and, accordingly, increases the latency experienced by the packet in the network element. Accordingly, it would be advantageous to reduce the number of threads required to fill a packet processing pipeline of a multi-thread packet processor.