A class of data networking equipment, referred to as aggregation or edge routers, has emerged that aggregates thousands of physical or logical links from end users onto one or more higher speed “backbone” links (such as OC-12, Gigabit Ethernet, OC-48 and higher) of a computer network. As the name implies, these routers reside at the edge of the network and are “gatekeepers” for packets transmitted over the high-speed core of the network. As such, they are required to perform a list of advanced “high-touch” features on the packets to protect the network from unauthorized use and to deal with a wide range of unique link interfaces, protocols and encapsulation methods that the end user links require. In order to provide these complex features and to provide flexibility for newly defined features, these routers are normally implemented using specialized, programmable processors.
The sheer processing power and memory bandwidth required to access data structures (e.g., tables) in order to process packets dictates the use of multiple processors within an edge router. A common approach is to organize these processors into one or more parallel one-dimensional (1-D) systolic arrays wherein each array is assigned an incoming packet to process. Each processor within a single systolic array “pipeline” is assigned a piece of the task of processing the packet and therefore only needs access to a memory associated with that task. However, the corresponding processors in other parallel 1-D arrays that are performing the same tasks may also share that same memory. Furthermore, each processor has a limited amount of time to process its packet without impacting the flow of packets through the system and negatively impacting packet throughput.
One of the tasks performed by the processors is the task of “dequeuing.” Dequeuing is the process of selecting a packet from an output queue that has a packet waiting to be sent and outputting that packet on a link. Output queues are typically associated with physical or logical links and more than one queue may be associated with a given link. That is, a link that supports different service priorities may have a separate queue assigned for each service priority that is supported on that link. Moreover, an edge router may include line cards configured to support interfaces and links of various data rates. For example, a line card may have a plurality of interfaces coupled to DS0 links while another line card may have its interfaces coupled to Gigabit Ethernet links. Notably, depending on the type of interface, a given link may have multiple channels associated therewith, e.g. channelized DS0. Thus, depending upon the configuration of the router, the task of dequeuing may involve searching among thousands of output links, to find one that is ready to output a packet (i.e., the link/channel is not exerting “back pressure”), and finding a corresponding output queue with a packet waiting to be sent. This task is further complicated by the fact that there can be wide variation in the speeds of the output links, e.g., from DS0 at 64K bits per second (bps) to OC-48 at 2.4 gigabits per second (Gbps).
In edge routers that employ systolic arrays, time and memory bandwidth are critical resources. A processor, in such an array, has a limited amount of time that it can spend finding a packet to be output (dequeued) on one of many thousands of possible links. Various software-based schemes have been employed to deal with this issue. Often these schemes use tables to track the queues. The size of these tables are often determined as a tradeoff between the amount of memory available to hold the tables and the number of processor cycles necessary to search the tables. However, despite taking into consideration this tradeoff, even the more efficient among these schemes is still not deterministic as to the time and number of memory accesses needed to perform a given search. In the case of a parallel systolic array, this non-deterministic behavior may result in the stalling of the pipeline and subsequently a degradation of all processing. To prevent this undesirable behavior, a scheduling algorithm may be defined to only perform specific memory accesses despite not always being successful. However, such an algorithm may result in missed opportunities to output a packet and subsequently lead to reduced output packet rates.
For example, a dequeuing mechanism can be implemented in software based on a “timing wheel” arrangement. In such an implementation, microcode executing on the processors constructs a “timing wheel” table that specifies which output channel to consider for dequeuing a processed packet. The timing wheel table represents an implementation of a scheduling algorithm and, as such, has entries that represent each of the various channels supported by the router. The range of channels may extend from, e.g., DS0 having a data rate of 64K bps to a Gigabit Ethernet (GE) having a rate of one Gps. Because the GE channel is much faster than the DS0 channel, there are more entries in the table for the former channel than for the latter DS0 channel. In other words, since the GE channel is serviced more often than the DS0 channel, there are more entries in the timing wheel table for the GE channel.
The microcode “walks” through the timing wheel table examining each entry to determine whether there is a packet available to be dequeued to the output channel associated with the entry. This represents an expensive operation from a processor resource consumption perspective primarily because of the number of conditions that need to be considered prior to rendering the dequeue decision (e.g., a packet is available in the output queue associated with the selected output channel and the selected output channel can receive the packet). Moreover, each task performed by a processor of the systolic array must generally be performed within a finite period of time (i.e., a phase) and, accordingly, those conditions must be analyzed and a decision rendered prior to the expiration of that phase. Furthermore within a phase, the processor may only have time to analyze one or two entries in the timing wheel without stalling the pipeline.
A problem with the software-based timing wheel table arrangement is that a packet may not be able to be dequeued (transmitted) each time the microcode accesses an entry of the table. One reason is that there may be no packet loaded into an output queue for a particular channel associated with the timing wheel table entry. Each entry that is accessed in the table that does not have a packet that can be dequeued presents a missed opportunity to transmit a packet. In other words, the microcode wastes time and misses potential dequeue opportunities by visiting entries represented in the timing wheel table that do not have a packet that can be dequeued (transmitted). This wasting of time consumes valuable resources and consequently can lead to a reduction in throughput.
Another condition that the processor must consider is whether or not there is sufficient capacity to transmit the packet over the internal data path to the selected channel. Using the conventional software approach, it is possible to “overload” the queues located at the “heads” of internal data paths that couple the arrays to the line cards. That is, even though there may be a packet in an output queue that requires servicing over an output channel, that packet may not be transmitted over the internal data path because of the limited bandwidth over that path. Here, an output command queue located at the head of the data path may be full (as indicated by, e.g., queue status information) such that it cannot accommodate the output packet. If the queue status indicates that a selected output channel cannot be serviced because the output packet cannot be dequeued, another opportunity to transmit a packet is lost.
A third condition that needs to be considered by a processor before it can transmit a packet over a selected channel is whether or not output buffers associated with the channel are full. Even though a packet may be available for transmission over a selected channel and there is capacity over the internal data path to provide that packet to the selected channel, there still may be a situation where the packet cannot be transmitted over a given output channel due to the fullness of small output buffers associated with the output channels. The status of these buffers is periodically monitored by the microcode to determine the fullness of these output buffers. Even though the two previous conditions may be met, it is possible that the microcode determines that it cannot transmit a packet over that channel because the output buffer is full.
Not being able to satisfy all the conditions necessary to transmit a packet over a selected channel within the allotted time phase results in a lost opportunity to transmit a packet from the router and, accordingly, adversely impacts the throughput of the router. The technique employed by the present invention addresses these problems associated with these conditions.