Among the single biggest limiting factors for a network processor to scale and meet the internet bandwidth demand is Moore's law. Moore's law limits the advancement in semiconductor process technology to 18 months in order to achieve a 100% performance improvement. FIG. 1 shows Moore's law versus the internet bandwidth demand curve.
As shown in FIG. 1, doubling every 18 months is far below the internet bandwidth demand which doubles every four to six months. Current generation of network processors cannot scale by 4 times or 16 times within a two to three years window to meet the demand in internet bandwidth. The lifetime of today's network processors is short due to the dependency upon Moore's law. Breaking the Moore's law barrier is a non-trivial process.
The current techniques in network processor architectures are bounded by Moore's law. In general there are three approaches to the network processor architecture: (1) using multiple reduced instruction set computing (“RISC”) processors, (2) using configurable hardware, and (3) using a mix of RISC and configurable hardware.
With regards to the first approach of using multiple RISC processors, the RISC processor architecture focuses on rapid and efficient processing of a relatively small set of simple instructions that includes most of the instructions a processor decodes and executes. The RISC processor architecture and instruction set are optimized for human-to-machine interaction. They are, however, not optimized for the high-bandwidth machine-to-machine interaction occurring in network equipment. Using multiple RISC processors within the network equipment will not deliver the desired processing power to meet the internet bandwidth demand. In this approach, another severe limiting factor is the complexity of the software compiler, scheduler, and kernel to efficiently control the processor's operation. Creating a new customized network processor operating system (“NPOS”) is not the solution to the explosive demand in bandwidth, especially when Moore's law (hardware) cannot even meet this demand. Use of the NPOS requires significant software resources to architect, create, implement, test, support and maintain it. Use of the NPOS results in significant performance degradation coupled with a non-deterministic architecture.
Use of configurable hardware results in the highest performance processor. In addition, the simple software interface usually used in configurable hardware minimizes performance degradation. Eliminating any software within the information path and replacing them with configurable gates and transistors significantly boosts the performance of the network processor. This approach, without any creativity within the architecture, is still bound by Moore's law.
Using a mix of RISC processors and configurable hardware has two different variations. The first variation uses the RISC processor in a portion of the data path and the other variation uses the RISC processor in the control path only.
Given the ever increasing bandwidth demand, RISC processors should be removed from the data path because they are not designed to optimally process the high-bandwidth data traffic coming from network equipment. Currently, RISC processors are being used as graphics processors and digital signal processors (“DSPs”) and have been tailored to meet the demands of these applications. Unfortunately, the general nature of network traffic processing is completely different than graphics processing or digital signal processing and the RISC processor architecture, which is based on techniques created decades ago, becomes a big burden for network traffic processing. For example, in a DSP, the execution unit is processing at a rate that is orders of magnitude faster than the data it is executing (i.e., the execution unit can easily process the incoming data). In other words, the data is relatively static in comparison to the execution unit. This is the case in both graphics and digital signal processing. In contrast, the information, data, voice and video entering at the ingress of a network processor is traveling at a very high speed and the growth rate of the line rate is in correlation with the bandwidth demand curve.
In addition, the RISC processor operands are typically either 32 or 64-bits, but these sizes are not suitable for network traffic processing where the information (operand) is much larger than 64-bits. In the prior art RISC processor architecture, the execution unit not only operates on short and fixed operands but also performs very simple and primitive functions such as load and store.
The typical RISC instruction set is designed to process algorithms. Many critical networking functions cannot efficiently utilize the arithmetic logic unit found in RISC processors. As a result, in addition to the low performance provided when performing networking functions, these arithmetic logic units waste silicon space. Moreover, the RISC instruction set is optimized for register-to-register operations. Performance of memory and input and output (“I/O”) operations are magnitude of orders behind the performance of register-to-register operations. When processing network traffic, the performance of memory and I/O operations are as important or more important than register-to-register operations.
When RISC processors are used in networking applications, they do not take advantage of the memory hierarchy of the RISC processor (e.g., in a RISC processor, the memory hierarchy may include a cache memory, main memory, etc.) that is optimized for memory locality. In networking applications, the traffic flows through the RISC processor without any locality. Placing a RISC processor in the data path causes only a small number of registers within the processor to be used by the traffic in the data path. In this case, the memory performance is almost as bad as the I/O performance.
Minimizing or eliminating context switching is important when processing dynamic traffic patterns of multiple streams and multiple services. Context switching is the act of turning the processor's resources from one task to another. An additional problem of using RISC processors in the data path is the context-switching penalty. When multiple processes share the same processor, the small register set and window of the processor causes frequent context switching. The frequent context switching takes away useable bandwidth from the processor. In networking functions, thousands of unpredictable traffic streams enter the processor and utilize different services and thus different processing units are invoked which, when using the RISC processor, results in a large number of context switches.
In addition to taking up otherwise useful processing bandwidth, context switching introduces a non-deterministic nature when processing networking functions. The non-deterministic nature includes, for example, not being able to predict or know when a packet will be output from the egress point. It is desirable that the processing of real time networking functions be deterministic.
FIG. 2 shows the processing and context switching occurring in a prior art RISC processor 200 performing networking functions. Here, an information element 204 (the information element is described below) belonging to a first flow is processed by a process 205. The process 205 executes primitive instruction set 202 such as “load”, “store”, “add”, and “sub” instructions to accomplish complex networking functions such as policing, encapsulation, forwarding, and switching. An information element 208 belonging to a second flow is processed by process 207. Similar to the process 205, the process 207 also executes a primitive instruction set 210 such as “load”, “store”, “add”, and “sub” instructions.
Processes 205 and 207 use a common set of registers 211 to store information specific to that process. When the prior art processor changes from servicing process 205 to servicing process 207, a context switch occurs in which the information pertaining to process 205 is removed from the registers 211 and stored in a stack and the information pertaining to process 207 is moved into the registers 211. The context switch 213 results in a register swap 214. The register swap 214 is the act of replacing, in the registers 211, the data of the old process with the data of the new process (i.e., the data in the registers for the old process is saved and the data for the new process is loaded into the registers). Because an indeterminate number of context switches occur before either the process 205 or the process 207 completes, these processes are non-deterministic as their time for completion is unknown. In addition to this non-deterministic nature, the context switching of processes that is inherent within the prior art RISC processor adds a substantial number of non-productive clock cycles (i.e., clock cycles are wasted storing the register data of the old process and loading the data of the new process into the registers).
As the number of flows supported increases, the number of different processes that the RISC processor supports also increases (each flow usually executes a different process since each flow uses a different service) resulting in the RISC processor performing more context switches. The flow is a connection of two end nodes in a connectionless protocol. The end node can be two computers or the software running in the computers. As more context switches occur, the performance of the RISC processor degrades due in part to the overhead involved with increased context switching. This overhead includes the time used for scheduling and the time used to perform the register swaps.
Currently, some network processor implementations employ the multiple RISC processor approach. In this approach, it is not clear whether there is an actual increase in performance due to the parallel processing. The multiple RISC processors do not increase the performance in a linear fashion due to a decrease in efficiency incurred with the bookkeeping and coordination resulting from the multiple processor implementation. The multiple processor approach may serve aggregated traffic through intelligently distributing threads of traffic to different processors. The balancing of each processor's load itself is an expensive task for the processor to perform. The process of balancing the load uses otherwise productive bandwidth and will not provide enough horsepower for a single heavy traffic stream. The parallelism in such traffic may not exist.
The increasing volume and evolving types of Internet applications have been demanding enhanced services, both in terms of performance and quality of services (“QoS”), from the Internet infrastructure. Best-effort service is the currently used service on the Internet. In best-effort service, everybody gets the service the network is able to provide. The best-effort service is not suitable for fast growing applications such as, continuous media, e-commerce, and several other business services. To provide better services to these important and expanding classes of applications, the Internet infrastructure should provide service differentiation.
The present invention pertains to a processor that overcomes the problems described earlier for processing network traffic. In addition, the processor provides deterministic behavior in processing real time network traffic.