The present invention is related to a scheduler. More specifically, the present invention is related to the design of a scheduler for a number of outputs in a distributed fashion.
ATM is currently viewed as the technology behind future integrated services networks. Within these networks, it is desirable that individual flows (VCs) be able to receive a guaranteed service rate through the network. While mechanisms have been developed that enable this to be performed in an output buffered switch, these are known not to scale. For an integrated services network to be cost-effective, there is a need for these services to be provided at low cost in a large scale switch. The present invention provides an efficient approach for providing bandwidth guarantees in a scalable switch.
Currently, ATM switches are primarily constructed as output buffered or shared memory based systems due to the simplicity in making non-blocking devices. Larger scale (10 Gbps) ATM switches are presently constructed using a layered set output buffers, one that accepts traffic at the aggregate rate of the device, another that accepts a slower rate and attempts to divide the bandwidth among a set of ports managed by the controller for this secondary memory.
Since the bandwidth managed for these ports is a valuable resource subject to contention, schedulers are used to order when connections are serviced for a given port. These schedulers are generally placed on the aforementioned secondary memory. The schedulers on these secondary memories attempt to provide service guarantees for egress traffic on the ports managed by it. These service guarantees are based largely on the assumption that the main point of contention among egress flows is at the secondary memory. In actuality, only a fraction of the system bandwidth is supplied to the secondary buffering point. This and the fact that multiple ports are commonly associated with these units leads to them often being referenced as multiplexors/demuliplexors in the literature.
As systems are constructed of increasingly larger scale, the fraction of total system bandwidth that can be provided to a single multiplexor decreases asymptotically, thus reducing the correctness of their model. When feedback flow control, such as ABR, is performed in the multiplexors of such large scale devices with incorrect system wide information, it is easy to see that the system can place itself into perennial instability.
While operational stability is of concern for a user of equipment, the manufacturing cost of goods is of primary concern for the company developing the switch (lower cost implies higher profit margins for a given device cost). The physical area, power, cooling, and cost of output buffered switches is well known to be an N2 problem, i.e., as the number of ports grow, the sum of the input bandwidth for N ports must be able to be buffered at each of N outputs. The cost/performance of memory technology exists as a step function. That is, for a desired amount of bandwidth, the cost remains relatively stable or increases with some rate during some periods with significant jumps in some locations. While increased width may be used to decrease the bandwidth required per-part, the systems are no longer able to pipeline accesses internally. In the limit, a single cell can be stored on one address in memory (53 Byte+overhead wide memory). At very high speeds, only SRAM can sustain the speed of accesses required. SRAM devices require more transistors to implement a memory of a given size that DRAM, this increases the cost of goods for these devices. The board area, power, and cooling for these SRAM devices (which grow with N2) is known to limit the scalability of output buffered switches.
In systems where connectivity alone is desired, many academic solutions have centered around constructs originally designed to perform circuit switching such as banyan, batcher banyan, and even feedback based networks including the aforementioned as components. These switches are often simulated under highly optimistic assumptions of uniform traffic distributions and lightly loaded networks. Real data networks contain servers for file systems, web pages, and additional services; these functions provide a valuable resource onto themselves and are cause for the output distribution to be asymmetric. The global Internet utilizes a core set of protocols, with TCP/IP being the foremost often used pair. The TCP stacks on end systems attempt to keep traffic in the network so that whenever bandwidth becomes available, it may be used by the applications.
Being based on circuit switching constructs, their key metric is blocking probability (an output link remaining idle when cells are enqueued in the system for it). However, even under the optimistic assumptions used by their designers, analysis often shows perceptible blocking probability (which is zero for output buffered switches). These switches are also centralized in nature, i.e., the entire switch core is located on a chip, or set of chips that are co-located on a board. This impacts the ability to construct fault tolerant devices. Network devices, including switches and routers, within such networks are thus often placed under high loads. Some of these switches would restrict or drop traffic for uncongested ports if other ports became congested for a small period of time (10s of cells). It is for these reasons that such switches have not found commercial success.
These circuit switch based devices generally had buffers placed at their inputs. Extensive analysis has been done on the tradeoffs of input versus output queued switches. In a non-blocking input buffered switch with FIFO queuing, when the cell at the head of the queue is blocked due to contention for a given output port, all cells behind it within the queue are prevented from being transmitted, even when their output port is idle. This situation is called head-of-line (HOL) blocking. This is a well known problem, that in the presence of uniformly distributed traffic across all ports results in limiting switch throughput to 58% of the bandwidth of the connecting links [M. Karol, M. Hluchyj, and S. Morgan. xe2x80x9cInput Versus Output Queuing on a Space-Division Packet Switch.xe2x80x9d IEEE Transactions on Communications, 35(12):1347-1356, December 1987.] In fact, throughput can fall as low as that of a single link [S. Li, xe2x80x9cTheory of Periodic Contention and its Application to Packet Switching.xe2x80x9d In Proceedings of IEEE INFOCOM ""88, 320-325, March 1988.].
While having poor throughput performance, avoiding buffering at the aggregate switch rate has encouraged further study in this field. [T. Anderson, S. Owicki, J. Saxe, and C. Thacker. xe2x80x9cHigh Speed Switch Scheduling for Local Area Networks.xe2x80x9d] separates data forwarding from system scheduling, and utilizes per connection queues at the inputs, while using a crossbar with a centralized switch scheduler. Fixed size frames are used to support guaranteed traffic. While this solves the blocking problem of earlier input queued switches, many limitations are present. Its guarantees are rather course grain. A crossbar is actually not an expensive mechanism in high speed switches, as the number of internal ports is low. The key problem is the centralized scheduler. While satisfactory for a local area switch of its time, this leads to an unacceptable failure point for a large scale enterprise or WAN switch required in the next few years.
Noting that the performance of large scale systems is limited by the bandwidth on the internal links, [F. M. Chiussi, Y. Xia, and V. P. Kumar. xe2x80x9cBackpressure in Shared-Memory-Based ATM Switches under Multiplexed Bursty Sourcesxe2x80x9d, In Proceedings of IEEE INFOCOM ""96] explored a switch using buffers at the inputs, along the outputs, and within the switch core. While this was shown to yield dramatic improvements in buffering requirements, no methods were proposed for providing bandwidth guarantees.
What is needed is a mechanism for providing a wide array of fine grain connection guarantees in a large scale networking device at a moderate cost. It is among the objects of the invention to overcome the aforementioned limitations of the prior art by providing a method and apparatus for constructing a distributed scheduler for a cell switched network.
The present invention pertains to a telecommunications switch. The switch comprises a first output port mechanism through which sessions having cells are sent at a total session rate to a network. The switch comprises a first input port mechanism through which sessions are received from the network. The first input port mechanism is connected to the first output port mechanism. The first input port mechanism has a first guaranteed session rate. The switch comprises a second input port mechanism through which sessions are received from the network. The second input port mechanism is connected to the first output port mechanism. The second input port mechanism has a second guaranteed session rate, the sum of all guaranteed session rates are less than or equal to the total session rate. The switch comprises a first scheduler connected to the first and second input port mechanisms and to the first output port mechanism for scheduling sessions of the input port mechanisms for service. The switch comprises a server for providing service to sessions of the input port mechanisms. The server is connected to the first and second input port mechanisms and the first output port mechanism.
The present invention pertains to a method for switching sessions having cells. The method comprises the steps of receiving a first session having cells at a first input port mechanism of a switch. Then there is the step of storing the first session in a first input queue of the first input port mechanism. Next there is the step of receiving a second session at a second input port mechanism of the switch. Then there is the step of storing the second session in a second input queue of the second input port mechanism. Next there is the step of providing service from a server to the first session at a first guaranteed session rate. Then there is the step of transferring cells of the first session to a first output queue of a first output queue mechanism. Next there is the step of sending the cells of the first session out of the switch to a network with a first output card connected to the first output queue and the network. Then there is the step of providing service from the server to the second session at a second guaranteed session rate. Next there is the step of transferring cells of the second session to the first output queue. Then there is the step of sending the cells of the second session of the switch to the network.
The present invention pertains to a method for building a scheduler for a large scale switch. In particular, this invention describes how to provide bandwidth and delay bounds in a buffered crossbar switch. In such a switch, buffers are maintained internal to the switch for each pair of input-output nodes. When a cell is sent from the switch core to an output, a credit is returned to the input that had sent the cell into the switch core. An input may send a cell to any output for which it has a credit. While prior art mechanisms have employed these techniques to reduce the complexity of switch design, they were unable to provide bandwidth or delay guarantees. This invention utilizes a scheduled hierarchy within the crossbar switch and at the input nodes to select the order in which cells may pass through the switch core. Separate matrix buffer pairs are maintained at each node for all source nodes within its section for destinations at itself and images in adjoining sections. These buffers enable scheduling decisions to be made with minimal local information, are small enough to fit onchip, and utilize a credit mechanism to denote when buffers are available. Credits are eventually returned to the source section of the nodes (which provide data into the matrix). The source section contains a per connection input queue which buffers all traffic arriving on its input port(s). Cells are scheduled for destination nodes (the output port interface section) within the switch that have buffer credits based on the relative needs of these destination nodes. This enables a very large switch to be constructed that provides per-flow guarantees in a distributed manner. Prior art schedulers are assumed to be output buffered, prior art large scale switches only provide connectivity.