1. Technical Field
Embodiments of the present invention generally relate to computer processors. More particularly, embodiments relate to packet scheduling in network processors.
2. Discussion
In the highly competitive computer industry, the trend toward faster processing speeds and increased functionality is well documented. While this trend is desirable to the consumer, it presents significant challenges to processor designers as well as manufacturers. A particular challenge relates to the processing of packets by network processors. For example, a wide variety of applications such as multi-layer local area network (LAN) switches, multi-protocol telecommunications products, broadband cable products, remote access devices and intelligent peripheral component interconnect (PCI version 2.2, PCI Special Interest Group) adapters use one or more network processors to receive and transmit packets/cells/frames. Network processors typically have one or more microengine processors optimized for high-speed packet processing. Each microengine has multiple hardware threads. A network processor also typically has a general purpose processor on chip. Thus, in a network processor, a receive thread on a microengine will often transfer each packet from a receive buffer of the network processor to one of a plurality of queues contained in an off-chip memory such as a synchronous dynamic random access memory (SDRAM). Queue descriptor data is stored in a somewhat faster off-chip memory such as a static RAM (SRAM).
Each queue may have an associated type of service (TOS) ranging from network control, which typically has the highest priority to best-effort TOS, which often has the lowest priority. Information stored in the packet headers can identify the appropriate TOS for the packet to obtain what is sometimes referred to as “differentiated service” approach.
Once the packets are assembled in the DRAM, either the general purpose on chip processor, or one or more micro-engines classify and/or modify the packets for transmission back out of the network processor. A microengine transmit thread determines the queues from which to consume packets based on queue priority and/or a set of scheduling rules. A number of scheduling techniques have evolved in recent years in order to determine when the transmit thread is to transition from one queue to another.
One queue transition approach follows a strict priority rule, in which the highest priority queue must be empty before packets will be transmitted from the next highest priority queue. This technique is shown in FIG. 10 at method 23 and can result in insufficient consumption from the lower priority queues or “starvation”. Such a result can become particularly acute in processing environments having heavy packet traffic. Another technique is to transition between the queues in a “Round Robin” fashion, in which one packet is transmitted from each queue, regardless of priority. FIG. 11 illustrates a conventional Round Robin approach at method 25. While the Round Robin technique can be useful in certain circumstances, the inherent disregard for queue priority can lead to significant unfairness in bandwidth allocation. Yet another technique has been to deplete each queue by a weighted amount depending upon the respective type of service and is described in “Efficient Fair Queuing using Deficit Round Robin”, M. Shreedhar et al., ACM SIGCOMM '95. FIG. 9 shows a conventional Deficit Round Robin (DRR) approach at method 21. While conventional DRR can address some of the shortcomings associated with conventional scheduling techniques, certain implementation difficulties remain.
The conventional DRR implementation shown in FIG. 9 is a single-threaded implementation that is suitable for a general purpose processor. However, general purpose processors are significantly slower than multi-threaded, multi-processor network processors in processing and scheduling packets. For example, one commercially available network processor uses as many as sixteen receive threads and six transmit threads to populate and read from the plurality of queues. Each queue is therefore shared by multiple threads. A DRR implementation that scales to high-speeds is therefore required for multi-threaded multiprocessor network processor architectures. Such an implementation would require the sharing of queue descriptors between multiple threads. Furthermore, the priority information is stored along with the queue descriptors in an off-chip location. As a result, a considerable amount of processing time can be expended in making the determination of whether to transition to the next queue. There is therefore a need for a system and method of processing packets in a multi-threaded multi-processor architecture that accounts for queue priority without sacrificing speed or performance.