In computer networks, information is constantly being moved from a source to a destination, typically in the form of packets. In the simplest situations, the source and destination are directly connected and the packet of information passes from the source to the destination, without any intermediate stages. However, in most networks, there are at least one, if not multiple, intermediate stages between the source and the destination. In order for the information to move from the source to the destination, it must be routed through a set of devices that accept the packet and pass it along a predetermined path toward the destination. These devices, referred to generically as switches, are typically configured to accept packets from some number of input ports and transmit that information to an output port, which was selected from a plurality of ports. Often, ports are capable of both receiving and transmitting, such that the input and output ports are the same physical entities.
In an ideal network, traffic arrives at an input port of a switch. The switch determines the appropriate destination for the packet and immediately transmits it to the correct output port. In such a network, there is no need for storing the packet of information inside the switch, since the switch is able to transmit the packet as soon as it receives it.
However, because of a number of factors, this ideal behavior is not realizable. For instance, if the switch receives packets on several of its input ports destined for the same output port, the switch must store the information internally, since it cannot transmit all of these different packets of information simultaneously to the same output port. In this case, the output port is said to be “congested”. This term also describes the situation in which the device to which this output port is connected is unable to receive or process packets at the rate at which they arrive for some reason. In such a case, the switch must store the packet destined for that output port internally until either the offending device is able to receive more information or the packet is discarded.
While it is possible that external factors may cause a switch to store packets rather than transmitting them immediately, it is a design goal of nearly all switches that they are able to process packets at the rate at which they are received. The speed at which packets are received, also known as line rate, is a critical parameter in the design of the switch.
Switches typically have a set of inputs, or input ports, where data enters the device. Similarly, switches also have a set of outputs, or output ports, whereby data exits the device. In many implementations, an input port and an output port will share a common physical connection, at the point where the device interfaces with other components. This point is typically the device's interface with the other components, and can be a lead or pin exiting the device, or an internal interconnect within a larger device, of which this specific switch is only a subset. Thus, in many implementations, the number of output ports and the number of input ports will be identical.
The design goal for a switch is that data can exit the output ports at the same rate as it entered the input ports, although it may be somewhat delayed. Several mechanisms have been developed to meet this requirement.
One such mechanism, known as input queuing is shown in FIG. 1. Input ports, I0 through In−1, are each associated with a memory element, M0 through Mn−1. Each memory element receives input data only from its associated input port. In addition to the input ports, each switch has a set of output ports. Typically, the number of input ports and output ports are identical, although this is not a requirement. The data received by any input port can be destined for any of the N output ports in the switch, thus connections between each memory element and each output port are shown.
In the worst case scenario, shown in FIG. 2, each input port receives data, in the form of a packet, destined for port O0 during the first time slot. Each packet is labeled with its destination output port, followed by the time slot during which it is to be transmitted. In the next time slot, each port, except I0 receives a packet for port O1. This pattern continues, so that in time slot N, only port In−1, receives a packet destined for port On−1.
In a non-blocking, ideal switch, the switch should be able to deliver the packets to the output ports in the minimum time period. As shown in FIG. 2, in time slot 0, output port O0 transmits its first packet. Since no packets have arrived yet for any other output ports, the other output ports remain idle. During the next time slot, the packets that arrived at input port I1 that are destined for output ports O0 and O1 are both sent. This process continues and in the general case, during time slot k, all packets that arrived at input port Ik up to that point are all transmitted simultaneously on output ports O0 through Ok. Therefore, memory element Mk must be able to supply data at full line rate to k output ports during a single time slot. In order to achieve this result, it follows that memory element Mk must run at a speed of k multiplied by the line rate. Thus, for a switch with N input ports, the memory elements must be able to supply data at N times the line rate of the switch. Since each memory element must also be able to receive a new incoming packet while transmitting to all output ports simultaneously, each memory element must run at a speed of at least (N+1) multiplied by the line rate.
A second consideration in the design of a switch is the amount of memory that is consumed. The amount of memory at each input port must be at least equal to the amount of buffering that is communicated to the neighboring switch. In the above example, if each input port had communicated that an amount of memory, M, was available, then the total memory in the switch can be expressed as N multiplied by M, where N is the number of input ports and M is the amount of memory at each input port.
A third consideration in the design of a switch is the complexity of scheduling the transmission of packets. The receipt of packets is achieved by having sufficient memory available at the input port. The transmission of packets to their respective output ports is most typically done through the use of a high speed scheduler, which typically uses a time multiplexing scheme to allocate a slice of each time slot to each output port in sequence. Although running at high speed, the scheduling algorithm is very simple and straightforward. This minimizes the time to design and verify its operation, which is often a key consideration in the design of new devices.
The first line of Table 1 illustrates the memory size and speed characteristics associated with a 12 port switch using an input queued structure.
TABLE 1Type of SwitchMemory SizeMemory SpeedInput Queued12*M13 * line rateCentral Memory12*M24 * line rateCIOQ18*M 3 * line rate
A second mechanism, using a centralized memory structure, can also be used to implement a switch. This mechanism, known as an output port queued switch with a centralized memory is shown in FIG. 3. In this implementation, rather than having separate memories as with input port queues, a single large memory is used. All of the input and output ports communicate with this centralized memory.
Referring to FIG. 4, it can be seen that there are scenarios in which each of the N input ports and each of the N output ports must be able to communicate simultaneously with the memory in order for it to operate in its most efficient manner. In the first time slot, each input port receives a packet destined for a different output port, scheduled for delivery in that time slot. To achieve this result, the memory must be able to complete all of these operations in a single time slot. In other words, the memory must operate at the line rate, multiplied by the total number of ports. Thus, the memory must operate at a speed of at least 2*N multiplied by the line rate of the incoming data, assuming that the number of output ports is the same as the number of input ports.
Since all N of the input ports must communicate with the single centralized memory, that memory must be large enough to accommodate the sum of amount of buffering that each input port has communicated to the neighboring switch. In this example, if each input port had communicated that an amount of memory M, was available, then the total memory in the switch can be expressed as N multiplied by M, where N is the number of input ports and M is the amount of memory at each input port.
The design of the scheduler is roughly equivalent to that of the input queued switch described above, where the scheduling uses a time multiplexing scheme to allocate a portion of each time slot to each output port.
The second line of Table 1 illustrates the memory size and speed characteristics associated with a 12 port switch using a output queued structure with a centralized memory.
Using current technologies, it is typically more feasible to include additional memory within a semiconductor device than it is to increase the speed of that memory. Consequently, much effort has been expended in both the intellectual and commercial pursuit of switches that can operate at lower memory speeds, even at the expense of adding memory elements.
One such implementation is known as combined Input-Output Queued (CIOQ) switch, as shown in FIG. 5. In this structure, a memory element is associated with each input port, as is done in the input queued switch. However, an additional memory element is associated with each output port as well. This additional memory element at each output port allows data to be moved from the input queues to the output queues, not only when it is being transmitted, but also during idle times. This alleviates the very high bandwidth requirements associated with the input queued switch.
Referring back to FIG. 2, the worst case traffic pattern for an input queued switch is also the worst case pattern for a CIOQ switch. Numerous research papers, such as Matching Output Queueing with a Combined Input Output Queued Switch, which was published by Stanford University and presented at Infocom '99, and is hereby incorporated by reference, have shown that a CIOQ switch can properly emulate a input queued switch for a broad class of scheduling algorithms if the transfers between the input queues and the output queues are performed at twice the line rate. Thus, the memory elements with a CIOQ switch need only operate at three times the line rate, to account for the two times line rate internal transmissions plus the external line rate transmission. This structure produces a much lower memory speed requirement than either of the other prior art approaches, especially as the number of ports increases.
To implement this structure requires memory elements associated with each input port and memory elements associated with each output port. As described earlier, the amount of memory at each input port is related to the available buffering that the port has communicated to the neighboring switch. The memory elements associated with the output ports are used to hold packets before being transmitted via the output port. These elements typically do not need to be as large as those associated with the input ports, and preferably are roughly half as large. Therefore, the amount of memory needed for the memory elements associated with the input ports, as before, is N multiplied by M, while the amount of memory associated with the output ports is N multiplied by M/2. This results in a total memory size of 1.5*N*M.
The third line of Table 1 illustrates the memory size and speed characteristics associated with a 12 port switch using a combined input output queued structure.
The CIOQ significantly reduces the required speed of the memory in exchange for a modest increase in the amount of memory. Based on current semiconductor technologies, this would appear to be the proper tradeoff. However, the CIOQ is not without significant drawbacks.
In order to achieve the benefits highlighted above, a complex scheduling algorithm is required. In fact, the previously cited Stanford paper states that the significant reduction in memory bandwidth comes at the expense of the scheduling algorithm. It further states that the algorithms proposed in the paper are not suitable for high port count switches. Other algorithms are possible; however, it requires significant development and testing time to verify that the scheduling algorithm works correctly under all types of traffic patterns and conditions. Mistakes in the algorithm will cause the switch to not forward packets efficiently, leading to potential network performance degradation. Furthermore, the development and testing of such a complex scheduling algorithm is a time consuming process, which could adversely affect the ability to bring the switch to market in a timely manner. Complex algorithms are also very difficult to implement in silicon. The scheduling algorithm must be designed to operate at a sufficiently high speed so as to keep up with the switching rate of the memories. As the algorithm becomes more complex and more steps are added, it becomes increasingly difficult to meet the required time constraints for the scheduling circuitry. It can then require significant development time to find the proper trade-offs between scheduling complexity, performance, and speed. These issues counteract the benefits in memory bandwidth described earlier, making the CIOQ switch less desirable.
Several trends in integrated semiconductor circuit design and overall system design give rise to the need for a new type of switch architecture. First, the line rate between switches continues to increase at a faster rate than the speed of the memory elements within the integrated circuits, Thus, it is becoming more and more difficult to develop input queued switches with the required memory bandwidth. Second, the number of ports on each switch continues to increase, putting further pressure on the memory bandwidth. Third, as semiconductor geometries continue to shrink, many integrated circuit (IC) designs are now pad-limited. This means that the size of the die is determined by the number of bonding pads that are required and not by the size of the logic within the IC. Therefore, the amount of logic within the chip can grow without affecting its cost, since the die size remains unchanged. Fourth, although logic and memory elements can be added without a monetary cost, there are hidden costs. For example, as memories increase in size, they decrease in speed. However, this relationship is not proportional; a increase of 100% in memory size will result in a memory speed decrease of about 10-20%. Also, the addition of more logic, specifically complex scheduling logic, can significantly impact the time it takes to develop and fully test a new switching IC. Furthermore, it is also difficult to run large complicated logic at very high speed due to the irregularities of layout and routing.
Based on these trends, several conclusions can be drawn. The first is that increasing the amount of memory to an IC generally is less expensive in terms of cost and time than increasing the speed of those memories. The second conclusion is that complicated, time-critical logic increases the risk of failure and the development time, and should be avoided as much as is possible.