The design of switch fabrics for applications in telecom, datacom, and many related applications is a long-standing problem, with a long history of solutions. This section presents a general definition of the problem.
1) There are N nodes in a system which require the ability to exchange messages with each other. Each node is attached to one of N switch ports on a switch fabric. Generally any port can send a message to any other port; there are thus N2 possible ingress-to-egress flows (or Nx(N−1) if self-to-self flows are ruled out).
2) In any (unicast) message exchange, one port acts as an ingress port, or source of the message, and one port acts as an egress port, or sink for the message.
3) The messages are generally transmitted as payload segments, which may be of small fixed length (e.g., ATM cells of 53 bytes), or of variable length, possibly extending up to thousands of bytes (e.g., TCP/IP packets).
4) Where messages are longer than payload segments, some component of the ingress port must segment the message into multiple payload segments. Some component in the egress port must then reassemble the message from the payload segments.
5) Switches generally support multiple classes of traffic. In such systems, each payload segment carries a class identifier. Packets of differing classes may have differing switching priorities. In a switching system with N ports and C classes, there are N2*C distinct flows (flows have distinct sources, sinks, and classes).
6) Most switching systems have some policy of quality of service (QoS). In general, QoS means that higher priority (or higher priority class) payload segments should take priority over lower priority segments, but there are many possible QoS policies. Priorities are associated with classes.
7) Most switching systems have some policy of fairness. In general, fairness means:
a. within each priority class, ingress ports should fairly distribute their offered traffic over all egress ports, and
b. within each priority class, egress ports should accept offered traffic evenly from all ingress ports.
8) Switches, in general, are subject to contention. There may be more payload segments addressed to some egress port than that egress port can consume. Such over-subscription may be short lived or long lived.
9) Ideally, switches should support their QoS and fairness policies even when presented with congesting traffic loads. In fact, many switches fail to accomplish this requirement.
10) Switches should minimize the amount of ingress port to switch core and switch core to egress port bandwidth that is consumed by control information (headers and other control segments).
11) Switches should be work conserving in the sense that egress ports should be kept full whenever there is offered load anywhere on the ingress side.
12) Switches should complete the transmission of payload segments from ingress ports to egress ports with minimal latency.
There is a large variety of applications of switch fabrics. In the more general solutions, there are traffic management (TM) devices connected to each port of the fabric. These TM devices may provide buffering in the ingress and egress paths (to and from the switch core). Independent of TM buffering, there may be buffering in the switch core itself. Our present focus is on the presence or absence of buffers in the switch core, independent of the presence or absence of buffering in TM devices or other devices attached to the ports of the switch fabric. Many switch fabric designs do require core buffering; this is a significant cost which must be considered in the design of switch cores.
There have been many tradeoffs in the design of switch fabrics. It is possible to build centrally buffered fabrics which fairly enforce a defined quality of service (QoS) and achieve minimal switching latency when possible. Central buffering has severely limited the maximal size of switching fabrics, as the physical realization of these buffers (or access to these buffers) has been limited by properties of underlying technologies such as CMOS. Fairness and/or QoS have often been compromised when full central buffering is abandoned to gain scalability.
Consider a buffering switch with N ports and C traffic classes. How many central buffers does it require to avoid a buffer starvation situation that could compromise either fairness or the QoS policy? Recall that this switch has N2*C flows. Now consider the latency of the switch as any single flow runs at ‘wire speed’. This number (L), rounded up to the next highest integer is the number of buffers required by the flow to maintain wire speed transmission. In most buffered switch designs, L is at least two, as one buffer is being filled while another is being emptied. Each buffer requires X bits.
The worst case fabric buffering requirement for this scenario is N2*C*L*X bits: a full set of wire speed buffers for each flow. Even a modest reduction in this buffer count can lead to a compromise of fairness or QoS enforcement (depending on the QoS definition). It turns out that the probability of maintaining fairness and QoS with a varying number of buffers asymptotically approaches certainty with many fewer than N2*C*L buffers. But there are always corner cases which can lead to fairness or QoS failures when there are fewer than N2*C*L buffers. Whether a small probability of failure matters is a function of the application, but there are many applications in which customers will not accept any probability of QoS failure. Often this insistence is more due to a reluctance to write control software to deal with the failure than due to the existence of the failure itself. But it is easy to imagine applications in which the possibility of failure is completely unacceptable (e.g., medical systems, flight control systems, or weapons control systems).
The key issue is that the number of buffers required for complete flow isolation (and fairness/QoS assurance) scales quadratically with the number of ports, N. In many applications, a large N is important, and this N2*C*L*X buffering cost comes to dominate the cost and even the feasibility of the switch core. The problem is much worse for applications that transfer very large payload segments (e.g., TCP/IP packets), as each payload segment requires a buffer.
There are two bodies of prior art in the area of switching technologies that are worth considering in this discussion: 1) the general area of virtual output queue (VOQ) switches, and 2) the area of request-grant switch fabric interface design.
The general concept of VOQ switches is that ingress ports maintain separate queues for each output, and that these various queues compete through the switch fabric for access to one ‘virtual’ output queue in each egress. The concept of VOQ switching is well known to those of skill in the art.
Request-grant protocols are one way to implement portions of VOQ switches. U.S. Pat. Nos. 6,212,182 and 6,515,991, both entitled “Combined Unicast and Multicast Scheduling” and assigned to Cisco Technology Inc., relate to request-grant semantics, though they relate particularly to the issue of multicast traffic. Both of those patents are incorporated herein by reference in their entirety.
The advent of request-grant switch interface semantics and virtual output queue switch design represent a significant advance in the art of switch fabric design. The key ideas of this approach are:
1) All buffering of payload contents should take place in the ingress ports, not in the switch fabric itself. This avoids the concentration of buffering costs in the critical switch fabric, where buffering costs have either increased the cost or reduced the scalability of earlier switch fabric designs. It is possible to provide full wire-speed per flow buffering when the buffering is divided into N separate portions and placed in the N ports, as each port requires only N*C*L*X bits.
2) Ingress ports send requests to the switch core. Requests carry the information that one (or more) payload segment(s) is(are) to be transferred through the switch from the ingress port which submitted the request to an egress port named in the request, at a class named in the request. The ingress port holds the associated payload segment until the switch core returns a grant to the ingress port.
3) The switch core stores received requests as counts. Each supported flow (<ingress port, egress port, class>) in the switch core has its own count.
4) The switch core treats non-zero flow counts as bids for output ports.
5) The switch core arbitrates among all bids fairly by class for access to egress ports.
6) The switch core notifies the ingress ports associated with winning bids of their success by sending them grant control segments, which indicate which flow can accept a previously requested payload segment.
7) The ingress port responds to grants by sending the payload segment associated with the grant (also associated with the earlier request). All ingress ports receive their grants in a nearly synchronous batch, and reply with their granted payload segments in a nearly synchronous wave. The switch core temporally aligns this wave of arriving payload segments (using small internal FIFOs) for synchronous switching through the internal, synchronous, switching paths. This method of switching is most efficient when all payload segments are of one common size.
8) The switch core forwards the payload segment received from the ingress port to the appropriate output port. This forwarding may be done in a cut-through manner, or in a store-and-forward manner.
FIG. 1 illustrates a standard request-grant message/time diagram. From left to right, the diagram shows an ingress port 10, a switch core 12, and an egress port 14. The left-to-right arrows represent control and/or payload messages. The top to bottom direction corresponds to increasing time. Flow number 16 indicates that the ingress port 10 sends a Request control segment to the switch core; the Request is from ingress port I, to egress port E, at class C. The switch core 12 responds by incrementing the corresponding request count, for <I, E, C>. The switch core 12 begins to consider this request count (now certainly greater than zero) in its regular arbitrations for output ports. At some time this request, <I, E, C>, wins the arbitration for egress E. At flow number 18, the switch core 12 then sends a Grant control segment containing <E, C> to port I. Ingress port I then returns, as part of flow number 20, the next payload segment destined for the flow <E, C>. The switch core forwards, as part of flow number 22, the received payload segment to egress port E. The forwarding may begin as soon as the segment begins to arrive in the core (cut-through), or may wait until the entire segment arrives (store and forward).
The primary benefits of this scheme are: 1) Payload segments need not be stored in the core (except possibly one segment per ingress port (I) during the process of forwarding the segment to the intended egress port (E)); and 2) As the switch core has a representation of every request in the form of a non-zero flow request count, the switch core can make as fair a decision as desired regarding which requests to honour during the next segment transfer time. Thus QoS can be enforced, with maximal fairness.
So, it can be seen that request-grant semantics are a development which avoids central buffering, so buffers can be distributed over the numerous ports. This allows the technology scaling problem to be avoided. At the same time, request-grant semantics preserve fairness and QoS.
However, the request-grant protocol introduces additional latency to the process of passing payload segments through the switch core. Even in the presence of an otherwise idle input and egress port pair, the transfer of a payload segment must wait for a request, a successful arbitration, a grant, and the time it takes the ingress port to retrieve the payload segment to be forwarded. Added latency is undesirable in many applications, especially those in which two (or more) processes communicate very frequently in a ping-pong fashion, and only one process or the communications channel can be active at any time.
It is, therefore, desirable to provide an approach that retains the advantages of request-grant semantics, while also supporting minimal latency when egresses are idle.