Computer networks, as well as other communication networks, routinely exchange information in units—commonly referred to as packets—corresponding to a known format. For example, a network may exchange packets associated with the TCP/IP protocols. See, e.g., Internet Engineering Task Force Request for Comment (IETF RFC) 791, Internet Protocol, and IETF RFC 793, Transmission Control Protocol. In addition to packets, other network data may also be exchanged over a computer network, including host requests/descriptors, timeout requests, transmission requests, as well as others.
Generally, operations performed on network data include receive processing, transmit processing, and timer processing. For example, when a packet is received at an end node of a network, such as a server or server cluster (or an intermediate device such as a router), one or more operations are usually performed with respect to the received packet. This packet processing may include, by way of example, accessing source and destination addresses and/or port numbers in the header of a received packet to make classification and/or routing decisions, flow control operations (e.g., sending a TCP ACK packet, error detection, etc.), queuing, as well as procedures associated with establishing or tearing down a connection.
Traditionally, processing of packets and other network data was performed on a general purpose processor supporting a single execution thread. Often times, the processing of a packet requires a large number of memory accesses to system memory or other off-chip memory. The single threaded processor performs operations sequentially and, therefore, it can stall during memory accesses and other slow operations while waiting for such operations to complete. Each stall in the processing of a packet due to a memory access (or other relatively slow process) wastes a significant number of clock cycles. The combination of a large number of unutilized clock cycles with the sequential nature of a single threaded processor creates an inefficient scheme for handling packets and other network data.
Processor clock cycles are shrinking at a much greater rate than memory access latencies. Thus, the number of clock cycles that may be wasted during a stall (e.g., for a memory access) is rapidly increasing, which in turn has caused the execution of packet processing—as well as the processing of other network data—on general purpose CPUs (central processing units) to rapidly decrease in efficiency, resulting in a failure to utilize the computational potential offered by high frequency processing devices. As equipment vendors strive to increase the speed and performance of general purpose computers (e.g., a client or server), the effects of the above-described failure to harness the abilities of high speed processors is becoming more profound.
The traditional method of using caches to reduce the frequency of memory accesses in application code is not very effective for processing packets and other network data due to a very low re-use of cached parameters and data. Also, conventional software multi-threading schemes do not provide a viable solution for the processing of network data. Today's multitasking operating systems (OS) may utilize methods of software multi-threading to share one or more processor execution threads between the many programs that may be executing on a computer simultaneously. However, OS multi-threading exhibits a very high overhead (e.g., thousands of clock cycles) because the OS implements a software scheduler common to all programs running on a system and, therefore, the OS has to deal not only with switching between threads but also with swapping program operating environments and contexts in and out of the CPU hardware in order to support the threads. Thus, due to these high thread switching latencies that typically consume thousands of clock cycles on conventional CPUs, such software multi-threading schemes cannot be utilized to hide memory accesses and other stalls that typically consume a few hundred clock cycles.
The masking of memory access latencies (and other stalls) experienced while processing packets and other network data may be achieved using multi-threaded hardware. Generally, hardware multi-threading schemes replicate certain hardware resources to facilitate parallel streams of execution. The use of multi-threaded processing hardware has been shown to be an effective method of hiding stalls for memory accesses and other slow operations. However, use of multi-threaded processors adds significant hardware complexity, while also increasing cost, real estate, and power consumption of the processing system. At the same time, these multi-threaded processors only provide a significant performance advantage for the few applications and operations that do not achieve effective use of cache. In addition, today's high volume (both in terms of manufacture and use) general purpose processors—e.g., those used for desk top and lap top computers—obtain less advantage from multi-threading than lower volume processors—e.g., those used for servers and workstations. Furthermore, it should be noted that, from a cost and power consumption standpoint, the use of lower complexity, higher volume, single threaded general purpose processing devices for desk top, lap top, server, work station, and packet processing alike, is desirable.
In the arena of network data processing, the use of a specialized packet processor supporting multiple threads of execution has been proposed. A multi-threaded packet processor can process multiple packets in parallel and very effectively reduce the performance cost of memory access stalls. However, these specialized packet processors suffer from many of the above-described disadvantages. Costs are increased due to added hardware complexity and lower volume markets. Further, in comparison to single threaded high volume processors (such as those used in desk top and lap top computers), the multi-threaded packet processor will have higher power requirements, increased cooling loads, and may be more difficult to program.