1. Field of the Invention
This invention relates generally to a computer system and more particularly to a processor that operates on network traffic.
2. Description of the Related Art
FIG. 1 illustrates a prior art line card 100 and its components. In the line card 100, the fiber-optic line 118 is coupled to the optical module 103. The other end of the fiber-optic line 118 typically connects to an external router or another communications device. Among other functions, the optical module 103 converts the optical signal into an electrical signal. The optical module 103 presents the electrical signal to the framer 106. The framer 106 performs functions such as: framing, error checking and statistical gathering. The framer 106 provides the framed information to a classifier 109 if the classifier 109 is present. The classifier 109 performs deeper classification and more complex classification than that provided by a network processor 112. For example, the classifier 109 may perform layer 5 through layer 7 classification. The network processor 112 processes the incoming information element and forwards it into the appropriate line card 100 within the system's backplane 121 using a switch fabric 115. Logically, the optical module 103 and the framer 106 perform layer one of the seven-layer Open Systems Interconnection (“OSI”) Reference Model, whereas the network processor 112 and the classifier 109 handle layers 2 through 7. Processing intelligence, power, and bandwidth capacity are the biggest differentiation factors between network processors.
Among the single biggest limiting factor preventing the network processor 112 from meeting the internet bandwidth demand is Moore's law. Moore's law limits the advancement in semiconductor process technology to 18 months in order to achieve a 100% performance improvement. FIG. 2 shows Moore's law versus the internet bandwidth demand curve. As shown in FIG. 2, doubling every 18 months is far below the internet bandwidth demand, which doubles every four to six months. The current generation of network processors cannot scale by 4 times or 16 times within a two to three year window to meet the demand in internet bandwidth. The lifetime of today's network processors is short due to the dependency upon Moore's law. Breaking the Moore's law barrier is a non-trivial process.
The current techniques in network processor architectures are bounded by Moore's law. In general there are three approaches to the network processor architecture: using multiple reduced instruction set computing (“RISC”) processors, using configurable hardware, and using a mix of RISC and configurable hardware.
For the first approach of using multiple RISC processors, the RISC processor architecture focuses on rapid and efficient processing of a relatively small set of simple instructions that includes most of the instructions a processor decodes and executes. The RISC processor architecture and instruction set are optimized for human-to-machine interaction. They are, however, not optimized for the high-bandwidth machine-to-machine interaction occurring in network equipment. With multiple RISC processors, it is not clear whether there is an actual increase in performance due to the parallel processing. The multiple RISC processors do not increase the performance in a linear fashion due to a decrease in efficiency incurred with the bookkeeping and coordination resulting from the multiple processor implementation. The multiple processor approach may serve aggregated traffic through intelligently distributing threads of traffic to different processors. The balancing of each processor's load itself is an expensive task for the processor to perform. The process of balancing the load uses otherwise productive bandwidth and will not provide enough horsepower for a single heavy traffic stream. The parallelism in such traffic may not exist.
When using multiple RISC processors, another severe limiting factor is the complexity of the software compiler, scheduler, and kernel to efficiently control the processor's operation. Creating a new customized network processor operating system (“NPOS”) is not the solution to the explosive demand in bandwidth, especially when Moore's law (hardware) cannot even meet this demand. Use of the NPOS means significant software resources to architect, create, implement, test, support and maintain it. Use of the NPOS results in significant performance degradation coupled with a non-deterministic architecture.
For the second approach, use of configurable hardware results in the highest performance processor. In addition, the simple software interface usually used in configurable hardware minimizes performance degradation. Eliminating any software within the information path and replacing it with configurable gates and transistors significantly boosts the performance of the network processor. This approach, without any creativity within the architecture, is still bounded by Moore's law.
For the third approach, use of a mix of RISC processors and configurable hardware has two different variations. The first variation uses the RISC processor in a portion of the data path and the other variation uses the RISC processor in the control path only. For the first variation where the RISC processor is placed in the data path, the RISC processor in the path does not optimally process the high-bandwidth data traffic coming from network equipment because the RISC processor is not designed for this purpose. Currently, RISC processors are being used as graphics processors and digital signal processors (“DSPs”) and have been tailored to meet the demands of these applications. Unfortunately, the general nature of network traffic processing is completely different from graphics processing or digital signal processing, and the RISC processor architecture, which is based on techniques created decades ago, becomes a big burden for network traffic processing. For example, in a DSP, the execution unit is processing at a rate that is orders of magnitude faster than the data it is executing (i.e., the execution unit can easily process the incoming data). In other words, the data is relatively static in comparison to the execution unit. This is the case in both graphics and digital signal processing. In contrast, the information, data, voice and video entering at the ingress of a network processor is traveling at a very high speed and the growth rate of the line rate correlates with the bandwidth demand curve.
In addition, the RISC processor operands are typically either 32 or 64-bits, but these sizes are not suitable for network traffic processing where the information (operand) is much larger than 64-bits. In the prior art RISC processor architecture, the execution unit not only operates on short and fixed operands but also has a simple and primitive instruction set that performs functions such as load and store. The typical RISC instruction set is designed to process algorithms. Many critical networking functions cannot efficiently utilize the arithmetic logic unit found in RISC processors. As a result, in addition to the low performance provided when performing networking functions, these arithmetic logic units waste silicon space. Moreover, the RISC instruction set is optimized for register-to-register operations. Performance of memory and input and output (“I/O”) operations are orders of magnitude behind the performance of register-to-register operations. When processing network traffic, the performance of memory and I/O operations are as important or more important than register-to-register operations.
When RISC processors are used in the data path, they do not take advantage of the memory hierarchy of the RISC processor (e.g., in a RISC processor, the memory hierarchy may include a cache memory, main memory, etc.) that is optimized for memory locality. In networking applications, the traffic flows through the RISC processor without any locality. Placing a RISC processor in the data path causes only a small number of registers within the processor to be used by the traffic in the data path. In this case, the memory performance is almost as bad as the I/O performance.
Minimizing or eliminating context switching is important when processing dynamic traffic patterns of multiple streams and multiple services. Context switching is the act of turning the processor's resources from one task to another. An additional problem of using RISC processors in the data path is the context-switching penalty. When multiple processes share the same processor, the small register set and window of the processor causes frequent context switching. The frequent context switching takes away useable bandwidth from the processor. In networking functions, thousands of unpredictable traffic streams enter the processor and utilize different services and thus different processing units are invoked which, when using the RISC processor, result in a large number of context switches.
In addition to taking up otherwise useful processing bandwidth, context switching introduces a non-deterministic nature when processing networking functions. The non-deterministic nature includes, for example, not being able to predict or know when a packet will be output from the egress point. It is desirable that the processing of real time networking functions be deterministic. FIG. 3 shows the processing and context switching occurring in a prior art RISC processor 201 performing networking functions. Here, an incoming information element 204 (the information element is described below) belonging to a first flow is processed by a process 205. The process 205 executes primitive instruction set 206 such as “load”, “store”, “add”, and “sub” instructions to accomplish complex networking functions such as policing, encapsulation, forwarding, and switching. Another incoming information element 208 belonging to a second flow is processed by process 209. Similar to the process 205, the process 209 also executes a primitive instruction set 210 such as “load”, “store”, “add”, and “sub” instructions.
Processes 205 and 209 use a common set of registers 211 to store information specific to that process. When the prior art processor changes from servicing process 205 to servicing process 209, a context switch 212 occurs in which the information pertaining to process 205 is removed from the registers 211 and stored in a stack and the information pertaining to process 209 is moved into the registers 211. The context switch 212 results in a register swap 214. The register swap 214 is the act of replacing, in the registers 211, the data of the old process with the data of the new process (i.e., the data in the registers for the old process is saved and the data for the new process is loaded into the registers). Because an indeterminate number of context switches occur before either the process 205 or the process 209 completes, these processes are non-deterministic as their time for completion is unknown. In addition to this non-deterministic nature, the context switching of processes that is inherent within the prior art RISC processor adds a substantial number of non-productive clock cycles (i.e., clock cycles are wasted storing the register data of the old process and loading the data of the new process into the registers).
As the number of flows supported increases, the number of different processes that the RISC processor supports also increases (each flow usually executes a different process since each flow uses a different service) resulting in the RISC processor performing more context switches. The flow is a connection of two end nodes in a connectionless protocol. The end node can be two computers or the software running in the computers. As more context switches occur, the performance of the RISC processor degrades due in part to the overhead involved with increased context switching. This overhead includes the time used for scheduling and the time used to perform the register swaps.
For the second variation, using a RISC processor in only the control path does not produce improved processor performance or overcome Moore's Law without creativity in the architecture that processes the incoming network traffic.
The present invention pertains to a processor that overcomes the problems described earlier for processing network traffic. In addition, the processor provides deterministic behavior in processing real time network traffic.