Network processors are special-purpose devices designed to process packets and/or streaming data. The basic functionality of network processors is to classify packets, that is, to determine the type of each packet and where that packet should go. Network processors may have other functionality such as collecting statistics and performing security operations. Some provide additional functionality not normally associated with network processing such as traffic management (queuing) and packet memory.
Ideally, network processors can be used in a variety of applications ranging from core Internet routers to metro/aggregation routers to enterprise routers found within large corporations to firewall routers. Since network processors process packets that are, in essence, streaming data, network processors are likely to be useful for other sorts of streaming computation such as MPEG encoding/decoding to perhaps even database transaction processing.
Network processors can be implemented in a variety of ways. The original network processors were general-purpose processors running a networking application. General-purpose processors, however, are far too slow for many networking applications today. Current network processor architectures range from hardwired special-purpose hardware (Sandburst), to configurable special-purpose hardware (AMCC) to programmable systolic arrays (Xelerated), to one or more RISC cores supported by highly specialized co-processors or co-processor interfaces (Intel, Agere, Motorola, Avici, EZChip, Cisco). One could argue whether or not a hardwired-solution that cannot be changed is a network processor at all, since it is not programmable. Regardless, more hardwired solutions are generally more power-efficient and silicon-efficient than more programmable solutions since they reduce or eliminate the interpretive cost of instruction execution and can place computation close to the data rather than always bringing the data to the computation. More programmable solutions, however, are more flexible and less prone to performance cliffs, where performance drops off rapidly beyond a certain load.
Programmable network processors have the distinct advantage of being able to support new protocols by simply reloading new microcode. (Network processor code is traditionally called microcode due to fact that most network processor code is low level code such as assembly code.). Network processors also tend to allow for one packet to consume cycles not used by another packet, replacing performance cliffs with a performance slope. It is sometimes the case, however, that the power cost of processing instructions over hardwired functionality is prohibitively expensive.
Network processor microcode depends on the network processor it runs on. Many network processors have a variant or restricted form of the C or C++ programming language to write microcode. Almost all network processors also allow users to write direct assembly code that is translated one-to-one to machine instructions that the network processor can interpret directly.
The number of instructions executed by a network processor to process a single packet varies widely between network processors and can also vary depending on the packets being processed. The Intel IXP2800, for example, has 16 micro-engines (each a small microprocessor with its own instruction store, registers and ability to access shared memory resources) running at up to 1.4 GHz. Since each micro-engine is theoretically capable of one instruction per cycle, the theoretical peak performance of such a processor is 22.4 G operations per second (theoretical peak is never reached in practice since memory latencies reduce instructions per cycle to well below 1.) Since the Intel IXP2800 is a 10 Gb/sec capable processor, it is supposed to be able to process and queue 25M packets per second (minimum-sized packets are 40 B.) Thus, each packet has a budget of almost 900 instructions.
The Avici Snare processor, on the other hand, runs at 100 MHz, as a single micro-engine and is capable of processing packets at 2.4 Gb/sec or about 6.25M packets per second. Thus, for Snare the instruction budget per packet is only about 14, substantially lower than Intel's processor. The IXP2800 is theoretically capable of packet queuing and buffering as well. But even after removing the instructions for queuing and buffering, the Intel network processor must execute substantially more instructions to process each packet.
The reason for the large difference in the number of instructions is the power of each instruction. In order for the Intel IXP2800 to implement a tree traversal, where a tree structure is stored in memory with each node of the tree either pointing to another node in the tree or to NULL, it must issue a load for the pointer in the first node in the tree, wait for that pointer to return, then use that pointer to issue the next read and so on. The Avici Snare, on the other hand, issues a single tree traversal command that returns only after a co-processor has traversed the tree. The Intel IXP2800 provides, for the most part, RISC-like instructions with RISC instruction power. The Avici Snare, on the other hand, has very powerful instructions customized for the tasks found in network processing applications.
Thus, the microcode to implement the same functionality varies substantially between network processors. Because of the very small number of instructions that the Avici Snare executes, writing microcode for it tends to be fairly straightforward. Writing efficient microcode for the Intel processor, on the other hand, is generally considered a very difficult task. Thus, a customized instruction set also helps the programmers writing code for the network processor.
In either case, however, there are limits to what the network processors are capable of doing. Snare is capable of processing packets using the instructions it has. If another instruction becomes necessary for a future packet processing requirement, that instruction cannot be added since Snare is an ASIC and its underlying structures cannot be changed.
Traditional microprocessors are designed to give the appearance of executing one instruction at a time which is sometimes called in-order instruction execution. For example, take the following code.
A: R0 = R1 + R2B: R2 = R0 + R3C: R6 = R4 + R5D: R2 = R2 + R1
Instruction B should see the architectural machine state, including the registers, condition codes, and so on, consistent with instruction A already having been fully executed. Likewise, instruction C should see the machine state being consistent with instruction B fully executed (and, by commutativity, instruction A would have been executed before instruction B.) Likewise, instruction D should see machine state consistent with instruction A executing to completion, then instruction B, then instruction C.
Such a machine has several advantages. The instruction-completes-before-the-next-instruction-starts model is very easy to understand. It is easy for a compiler to generate such code. Techniques for improving the performance of a processors that support the single-instruction model are well known and have been implemented in many processors. For example, instruction C is independent of instructions A, B and D and thus can correctly execute before or after any of them. By executing independent instructions at the same time as other independent instructions, performance can be improved while still maintaining the illusion of a single instruction executing to completion before the next starts. (In general, executing independent instructions simultaneously can destroy the illusion of in-order instruction execution. Exceptions and reordered memory operations when there are multiple writers are two examples of when additional support must be provided to allow out-of-order execution to appear to be in-order.) Machines that dynamically determine which instructions are independent and can execute in parallel and actually execute instructions out of program order are called out-of-order processors. Such techniques do not require machine executable code to match the processor in order to run efficiently. For example, imagine a processor that can execute two independent instructions at a time compared with a processor that can execute four independent instructions at a time. Since the processor itself determines which instructions can be executed in parallel rather than encoding that information into the instructions, both processors can potentially extract available parallelism in any program.
Determining what instructions can be executed concurrently is not trivial and does require a significant amount of hardware resources. It is possible to define an instruction set architecture (ISA) that specifies multiple instructions that can be executed concurrently in a single block of instructions. Generally, the number of instructions in a block is fixed and often times there is a fixed mix of instructions within a block. For example, an instruction block might contain 2 integer instructions, 1 load/store instruction, 1 floating point instruction and 1 branch instruction. The reason for a fixed mix of instructions is obvious; there is a one-to-one correspondence between the functional units within the processor and the allowed instructions per block. Such ISAs are called Very-Long-Instruction-Word (VLIW) ISAs. VLIW processors can issue instructions to every functional unit simultaneously (but, obviously is not required to issue to every functional unit for every instruction), thus maximizing the parallelism that can be exploited and leveraging the available functional units.
Machines that implement VLIW ISAs tend to be far simpler than standard machines since they do not need to dynamically determine which instructions can execute concurrently. The compiler has done that statically by specifying bundling of single instructions into VLIW instructions. To further simplify the architecture and implementation, most VLIW machines execute each VLIW instruction to completion within a pipeline stage before advancing to the next pipeline stage. Doing so dramatically simplifies the hardware at the cost of performance. One slow instruction within a VLIW instruction will stall all of the other instructions in the same VLIW instruction and all other VLIW instructions behind it.
In order to further improve performance, some VLIW machines are also multithreaded as well. One such machine is the Tera/Cray MTA. Rather than let a slow VLIW instruction block the machine, the machine can switch to another thread where the previous instruction is completely finished and the next VLIW instruction is ready to execute. Such a machine enjoys the simplicity of in-order execution while paying relatively little to support multi-threading and thus avoid the penalties of in-order execution when multiple threads are available to execute.
Once defined, a VLIW ISA can limit machines that implement that ISA. For example, if a VLIW ISA specifies a certain mix of component instructions, going to a machine that has more functional units does not improve performance. One could specify a VLIW ISA that is much larger than any current machine, thus giving the machine room to grow, but then code will often wind up with many no-op instructions since there is not always instructions that can be executed concurrently and thus expand the program size. Also, executing such a super-VLIW ISA on a machine with fewer functional units would require hardware support to break down those super-VLIW instructions. Specifying a variable number of instructions within a VLIW instruction is another solution but also requires more complex hardware to deal with the variable number.
The simplest solution to the issue of a VLIW ISA limiting the implementation is to recompile the VLIW code for a specific target machine. Though undesirable from a code compatibility standpoint, recompiling ensures that the VLIW word is correctly sized for the machine that will run the code and thus keeps the hardware simple.