In computer networks, information is constantly being moved from a source to a destination, typically in the form of packets. In the simplest situations, the source and destination are directly connected and the packet of information passes from the source to the destination, without any intermediate stages. However, in most networks, there are at least one, if not multiple, intermediate stages between the source and the destination. In order for the information to move from the source to the destination, it must be routed through a set of devices that accept the packet and pass it along a predetermined path toward the destination. These devices, referred to generically as switches, are typically configured to accept packets from some number of input ports and transmit that information to an output port, which was selected from a plurality of ports. Often, ports are capable of both receiving and transmitting, such that the input and output ports are the same physical entities.
In an ideal network, traffic arrives at an input port of a switch. The switch determines the appropriate destination for the packet and immediately transmits it to the correct output port. In such a network, there is no need for storing the packet of information inside the switch, since the switch is able to transmit the packet as soon as it receives it.
However, because of a number of factors, this ideal behavior is not realizable. For instance, if the switch receives packets on several of its input ports destined for the same output port, the switch must store the information internally, since it cannot transmit all of these different packets of information simultaneously to the same output port. In this case, the output port is said to be “congested”. This term also describes the situation in which the device to which this output port is connected is unable to receive or process packets at the rate at which they arrive for some reason. In such a case, the switch must store the packet destined for that output port internally until either the offending device is able to receive more information or the packet is discarded.
Patent application Ser. No. 10/794,067, which is hereby incorporated by reference, describes a system and method of implementing multiple queues within a switching element to store packets destined for congested paths. Briefly, the switch determines the path of the packet, specifically, the action to be taken by the adjacent downstream switch, to determine whether it is destined for a congested path. The packet header contains the path of the packet as defined by the successive actions taken by each switching element. In other words, the header might specify the output port to be used by each switch in the path. The switching element compares this path specified in the header to its list of known congested paths. Based on that comparison, the switching element either forwards the packet or moves it to a special congested flow queue, where it remains until either a specific time period has passed or the path is no longer congested. That patent application describes several mechanisms by which a switch is notified of congested paths. One technique is known as Status Based Flow Control, where a downstream node explicitly informs an upstream node that at least one of its output ports is congested. This can be accomplished in the form of a message telling the sender to stop transmitting packets that are to be sent via the congested output port, followed by a second message telling it to resume when the congestion is resolved. Alternatively, the destination might transmit a message telling the source to stop transmitting packets destined for the offending output port for a specific time period.
ASI (Advanced Switching Interconnect) is an industry standard protocol, based on the PCI Express specification. Advanced Switching (AS) allows for the standardization of today's proprietary based backplanes. Advanced Switching uses the same physical-link and data-link layers as the PCI Express architecture, taking advantage of the tremendously large ecosystem. AS is a multi-point, peer-to-peer switched interconnect standard offering encapsulation of any protocol, multiple messaging mechanisms, QoS including congestion management, extended high availability features and much more. The ASI specification is written, updated and maintained by the ASI SIG (Special Interest Group) and the current version of the specification can be found at www.asi-sig.org/members/Core AS Rev1 0.pdf, and is hereby incorporated by reference. Similarly, the PCI Express specification is written, updated and maintained by the PCI SIG and the current specification can be found at www.pcisig.org/members/downloads/specifications/pciexpress/pciexpress base 10a.pdf, and is also hereby incorporated by reference.
ASI defines a mechanism by which upstream switches are notified of downstream congestion. Specifically, a switch which is experiencing congestion at one of its output ports can transmit a special message, known as a Data Link Layer Packet (DLLP) to an adjacent upstream switch. This DLLP contains multiple fields, one of which contains the output port that is experiencing the congestion and another that specifies the desired action that the upstream switch should take in response to the congestion. This mechanism is very effective to communicate congestion in one switch to the adjacent switch, however, it is specifically limited to this application. The format of a DLLP does not allow this mechanism to scale to identify congested paths through the entire fabric.
Thus, while congestion is reduced since traffic is no longer being sent to the congested port, there are undesirable effects of this scheme. Consider the scenario where there are three switches, A, B and C, in series. Assume that the most downstream switch, C, experiences congestion at its output port 5. It communicates this information back to its adjacent switch, B, which now stops transmitting packets destined for output port 5 of the next switch. Assume that all such packets are transmitted via output port 3 of the intermediate switch B. These packets are then stored in a congestion queue, waiting for the congestion to pass. At a later time, this intermediate switch B cannot store any more packets destined for output port 5 of the downstream switch. Since DLLPs only permit the switch to identify its congested port, the intermediate switch B sends a DLLP to the upstream switch A, informing it that its output port 3 is experiencing congestion. At this point, the upstream switch A stops transmitting packets destined to be transmitted via output port 3 of the intermediate switch B.
This behavior is an appropriate response to the congestion issue presented above, however, there were packets in upstream switch A which could have been sent, which are not, because of the limitations of the DLLP mechanism. Specifically, any packet in upstream switch A destined for output port 3 of intermediate switch B will be held. However, only packets destined to be transmitted by output port 5 of downstream switch C truly needed to be held. Thus, any packet in upstream switch A which is transmitted via output port 3 of intermediate switch B, and was then intended to be transmitted via output port 7 of downstream switch C is unnecessarily held. Similarly, any packet in upstream switch A destined to be transmitted via any output port in downstream switch C except output port 5 will unnecessarily be held in upstream switch A. This reduces the throughput of the fabric and increases latency.
Therefore, it is an objective of the present invention to define a mechanism that enables the fabric to identify and communicate not only congested output ports, but also congested paths to all interested switches throughout the network fabric. It is a further objective of the present invention to define this mechanism in such a way that it can be incorporated into the ASI specification in a backward compatible manner.