Among all the competing requirements put on the switch fabric designs of the current generation, scalability of number of ports and cost-effectiveness are two fundamental issues that should be addressed. Two ways to build a cost-effective and scalable switch fabric are distinguish. The first option is the widely adopted single-stage switch architecture which is very efficient but has scalability limits because of its quadratic complexity growth (as a result of linear growth of the number of ports). The second option is the multistage switch architecture which provides higher throughput by means of more parallelism, but which is generally more complex and less efficient than single stage switches.
A multistage switch architecture is also referred to as a Multistage Interconnection Network (MIN), i.e., a fabric arrangement of “small” single-stage switching modules interconnected via links in multiple stages or mesh-like in such a way that switching and link resources can be shared by multiple connections resulting in a complexity growth smaller than N2, typically in the order of N log N, where N is the total number of ports of the switch fabric. Although it is recognized that MINs are needed to obtain very high throughput and support for large number of ports, their common introduction has been repeatedly postponed over the last decade. A reason is that continuous new innovations in single-stage switching system design together with new opportunities created by advances in underlying technologies were able to keep pace with the market requirement increases over the same period. Also, within their range of scalability, single-stage switching architectures remain very attractive as they provide the most cost- and performance-effective way to build an electronic packet switch network.
Single-stage switch architectures can be classified into two types: architectures with centralized control and architectures with distributed control. The latter type consists of parallel switching domains, each having an independent scheduler (control domain). Its main drawback is that it requires some complexity overhead incurred by load balancing and reordering algorithms that handle the packets distributed over the multiple switch domains. In the literature, this is also referred to as Parallel Packet Switching (PPS). On the other hand, the switch architecture with centralized control only has one switch domain which usually consists of several switch slices operated in parallel. Operating multiple switch slices in parallel enables an increase in switch port speed and thus allows to build a switching core with higher speed. This approach is used in a number of single-stage switches as it allows to build systems handling large numbers of external links by multiplexing them onto a single link of higher speed. For a given circuit technology, there is a limit to the applicability of this technique, but within its applicability range it offers the most cost-effective way to scale to larger sized switches. Other reasons that make the single-stage switch designs based on centralized control approach very popular, are the singularity of its scheduling scheme and its ability to implement any queuing structure: shared-memory-based output-queued structure, crossbar-based input-queued structure or combined input-output-queued structure.
The problem concerned with the present invention applies to switch architectures with centralized control. The aim is to provide a means to improve their inherent growth limitation. This is done by facilitating the aggregation of multiple switch elements and have them operated in parallel in a so-called Port Speed Expansion mode. This improvement also indirectly applies to MIN architectures as they are usually composed of single-stage switching modules.
In the computer community, data and pipeline parallelism have long been exploited to achieve higher bandwidth. When applied to packet switching technology in electronic networks, this translates into packets being switched over multiple parallel slices, and is sometimes referred to as Port Speed Expansion.
An early description of port speed expansion can be found in an article by W. E. Denzel, A. P. J. Engbersen, and I. Iliadis, entitled “A flexible shared-buffer switch for ATM at Gb/s rates”, published in Computer Networks and ISDN Systems, Vol. 27, No. 4, January 1995, pp. 611-624. In this paper, port speed expansion is used to expand the port rate in a modular fashion by stacking multiple slaves chips and have them controlled by a single-master chip.
A particular port speed expansion embodiment applied to an output queued switch architecture is also described in the European patent application EP0849917A2.
The problem concerned with the present invention is now in more detail the following. A well known difficulty of port speed expansion is the complexity of its implementation due to the fact that master and slave modules have to be tightly synchronized. At high port rate, this leads to complex and/or expensive synchronization logic which usually limits the physical degree of parallelism and thus the maximally achievable throughput. Therefore there is a need to decouple the scalability of a port speed expansion scheme from its implementation complexity incurred by synchronization issues.
In a switch fabric core operated in port speed expansion mode, the component switches are termed as either “Master” or “Slave” switch. A port speed expanded switch fabric contains one Master, and one or more Slaves components. Master and Slaves may be connected in any arbitrary topology such as a chain, a ring, or a tree. The general concept of port speed expansion is now described/recalled with reference to FIG. 1 which illustrates an example related to the prior art commercial product IBM PRS64G where only one Slave is used. The PRS64G is a packet routing switch that implements 32 input and 32 output ports, each running at 2 Gb/s, for a total aggregate bandwidth of 64 Gb/s. Combining two of these chips in port speed expansion mode enables to operate the physical ports at 4 Gb/s and to build a switch fabric with twice the aggregate bandwidth (128 Gb/s). When a packet to be switched is received by the ingress fabric interface it is split into several parts, termed here “Logical Unit” (LU) (or later also termed “Segment”). In this particular example, the number of LU's equals the number of component switches, but this is not a prerequisite. Next, the ingress fabric interface sends one LU of each packet to the Master switch, and the following LU to the Slave switch. The first LU contains only part of the initial packet payload but it has the full packet header which includes handling information. The second LU, which is passed to the Slave, contains only payload information and no routing information. The Master handles its LU according to the routing and Quality-of-Service information carried by the packet header, and then informs the Slave about its scheduling decision by sending an appropriate (derived) control information to it. For every LU received by the Master, a derived control information is sent to the Slave over a so-called ingress port speed expansion bus. Likewise, when the Master schedules a packet to be transmitted, a similar control information is sent to the Slave over an egress port speed expansion bus. Because of the propagation delay of the egress control path, the master egress LU may actually leave earlier than the slave egress LU. In some cases, an additional transmit synchronization mechanism may be needed between the Master and the Slave, if the two outgoing LU's are required to reach the egress fabric interface at nearly the same time. From the description above, it is obvious that a port speed expanded fabric calls for control of the propagation delays and a precise match of two different flows, namely: the data flow from ingress fabric interface toward fabric core and egress fabric interface (drawn horizontally in FIG. 1) and the control flow from master to one or multiple Slaves (drawn vertically in FIG. 1). Given the packet duration example of FIG. 1 (128 ns for a 64 Bytes packet) and the compactness of the switch fabric core (built on a single-board), this was easily achieved by ensuring that the control information reaches the Slave within one packet cycle of 128 ns, which is ample of time for a single board design in the current technology.
Meanwhile, because of continuous increase in data link rates and system sizes, speed expanded systems have gotten progressively more and more difficult to build. On one side, the faster data link rates have caused packet durations to decrease but have required higher degree of parallelism in the port speed expansion implementations. On the other side, bigger system sizes have forced designers to distribute the switch fabric over multiple boards and racks, thus increasing link distances for data flows and/or control flows within the fabric. Given all these more strict system requirements and sizes, it gets very difficult and/or expensive to precisely control and match the propagation delays between elements which are physically distributed and for which packet durations have decreased at the same time. In particular, it may occur that the multiple LU's from one packet may not arrive at the Master and one or more Slave switches at the same or close to the same time. In fact, it may occur that LU's from completely different packets arrive at the Master and/or the Slave switches at the same or nearly the same time.
Assuming a chain based topology example of 1 Master and N−1 Slaves as depicted in FIG. 2, a possible solution is to provide each Slave with a means to measure the latency of the control path at system initialization time, and to insert a digital programmable delay into the data path of each Slave that compensates and matches the propagation delay of the control path. Measurement of the control path latency is done relative to a synchronization signal broadcast by the Master to all Slaves. Once the latency of the control path has been measured by each Slave, the digital programmable delay of the data path is set accordingly and individually within each Slave, so that the control and data path delays match on a packet cycle basis. Although this proposal goes in the right direction, it solves only half of the problem as it is not able to compensate for different latencies in the port speed expanded data paths (see Data Path Skew in FIG. 2). In fact, the proposed scheme only works if the system is rather tightly synchronized, such that all LU's sent by the ingress fabric interface reach the fabric core within a skew window which is less than a packet cycle duration. At a port rate in the order of 10 Gb/s (OC192), this may be achievable if the number of ports allows the physical fabric size to be built in a compact way of say a single electronic rack. For systems of larger dimensions and higher port rates such as 40 Gb/s (OC768), the local synchronization method not only should compensate for latency of the control path but should also compensate for the unpredictable skew in the propagation paths of both data and control information, and this for any (arbitrary) topology. Also, in order to be easily scalable, the method should be able to relax the synchronization constraints incurred by the port speed expansion concept.