1. Field of the Invention
The present invention relates to communication networks and more specifically to network switches and associated switch to switch protocols which provide improved bandwidth utilization and load balancing in data processing communication networks having redundant paths between network devices.
2. Related Patents
This patent is related to the following commonly owned patents: U.S. patent Ser. No. 09/228,110 entitled Load Balancing Switch Protocols, U.S. patent Ser. No. 09/228,159 entitled Identity Negotiation Switch Protocols, U.S. patent Ser. No. 09/228,913 entitled Cost Calculation in Load Balancing Switch Protocols, U.S. patent Ser. No. 09/228,087 entitled Broadcast Tree Determination in Load Balancing Switch Protocols, U.S. patent Ser. No. 09/228,918 entitled MAC Address Learning and Propagation in Load Balancing Switch Protocols, U.S. patent Ser. No. 09/228,992 entitled Path Recovery on Failure in Load Balancing Switch Protocols, and U.S. patent Ser. No. 09/228,169 entitled Discovery of Unknown MAC Addresses Using Load Balancing Switch Protocols, all of which are hereby incorporated by reference.
3. Discussion of Related Art
It is common in present computing environments to connect a plurality of computing systems and devices through a communication medium often referred to as a network. Such networks among communicating devices permit devices (or users of devices) to easily exchange and share information among the various devices. The Internet is a presently popular example of such networking on a global scale. Individual users attach their computers to the Internet, thereby enabling sharing of vast quantities of data on other computers geographically dispersed throughout the world.
Networked computing systems may be configured and graphically depicted in a wide variety of common topologies. In other words, the particular configurations of network communication links (also referred to as paths) and devices between a particular pair of devices wishing to exchange information may be widely varied. Any particular connection between two computers attached to a network may be direct or may pass through a large number of intermediate devices in the network. In addition, there may be a plurality of alternative paths through the network connecting any two network devices. Present day computing networks are therefore complex and vary in their configurations and topologies.
Most present network communication media and protocols are referred to as packet oriented. A protocol or communication medium may be said to be packet oriented in that information to be exchanged over the network is broken into discrete sized packets of information. A block of information to be transferred over the network is decomposed into one or more packets for purposes of transmission over the network. At the receiving end of the network transmission, the packets are re-assembled into the original block of data.
In general, each packet includes embedded control and addressing information that identifies the source device which originated the transmission of the packet and which identifies the destination device to which the packet is transmitted. Identification of source and destination devices is by means of an address associated with each device. An address is an identifier which is unique within the particular computing network to identify each device associated with the network. Such addresses may be unique to only a particular network environment (i.e., a network used to interconnect a single, self-contained computing environment) or may be generated and assigned to devices so as to be globally unique in co-operation with networking standards organizations.
At the lowest level of network communication, such addresses are often referred to as MAC address (Media ACcess address). Network protocols operable above this lowest level of communication may use other addresses for other purposes in the higher level communication techniques. But in most network low level communication levels, operable on the physical link medium, an address is referred to as a MAC address.
In many present commercially available network environments, the network communication medium is in essence a bus commonly attached to a plurality of devices over which the devices exchange. In a simple networking topology, all devices may be attached to a such a bus structured common network medium. Any particular single network medium has a maximum data exchange bandwidth associated therewith. The maximum data exchange bandwidth of a medium is determined by a number of electrical and physical properties of the medium and protocols used to communicate over that medium. For example, a popular family of related network media and protocols are collectively referred to as Ethernet. Ethernet defines a standard protocol for the exchange of messages over the communication medium. A variety of communication media are also defined as part of the Ethernet family. The communication bandwidth of the Ethernet family of standards range from approximately 10 Mbit (million bits of information) per second to 1 Gbit per second. Therefore, a single (slow) Ethernet connection, for example, has a maximum data exchange bandwidth of approximately 10 Mbit per second.
In present network computing environments, a number of devices are used in addition to interconnected computing systems to efficiently transfer data over the network. Routers and switches are in general network devices which segregate information flows over various segments of a computer network. A segment, as used herein, is any subset of the network computing environment including devices and their respective interconnecting communication links. As noted above, a single computer network communication link has a maximum data transfer bandwidth parameter defining the maximum rate of information exchange over that network. Where all devices on a computer network share a common network medium, the maximum bandwidth of the computer network may be rapidly reached. The overall performance of the networked computing environment may be thereby reduced because information exchange requests may have to await completion of earlier information exchange requests presently utilizing the communication link.
It is often the case, however, that particular subsets of devices attached to the network have requirements for voluminous communication among members of the same subset but less of a requirement for information exchange with other devices outside their own subset. Though standard switch features generally do not include identifying such logical groupings of devices, some enhanced switching features do permit such logic to be performed within a switch device. For example, some enhanced switch features include the concept of defining and routing information based on virtual LAN (VLAN) definitions. In a VLAN, a group of devices may be defined as logically being isolated on a separate network although physically they are connected to a larger network of devices. VLAN features of enhanced switches are capable of recognizing such VLAN information and can route information appropriately so that devices in a particular VLAN are logically segregated from devices outside the VLAN.
For example, the financial department of a large corporation may have significant information exchange requirements within the financial department but comparatively insignificant needs for data exchange with other departments. Likewise, an engineering group may have significant needs for data exchange within members (computing systems and devices) of the same engineering group but not outside the engineering group. There may in fact be multiple of such subsets of devices in a typical computing network. It is therefore desirable to segregate such subsets of devices from one another so as to reduce the volume of information exchange applied to the various segments of the computer network.
In particular, a switch device is a device that filters out packets on the network destined for devices outside a defined subset (segment) and forwards information directed between computing devices on different segments of a networked computing environment. The filtering and forwarding of such information is based on configuration information within the switch that describes the data packets to be filtered and forwarded in terms of source and/or destination address information (once address locations are xe2x80x9clearnedxe2x80x9d by the switch(es)).
Network switch devices and protocols associated therewith are also used to manage redundant paths between network devices. Where there is but a single path connecting two network devices, that single path, including all intermediate devices between the source and destination devices, represent a single point of failure in network communications between that source and destination device. It is therefore common in network computing environments to utilize a plurality of redundant paths to enhance reliability of the network. Multiple paths between two devices enhances reliability of network communication between the devices by allowing for a redundant (backup) network path to be used between two devices when a primary path fails.
FIG. 1 shows an exemplary, simple networked computing environment in which multiple paths exist for communication between devices A 100, B 102, and C 104. These exemplary network devices are each attached to one of a plurality of switches (S1106, S2108, S3110, and S4112). Each device has multiple possible paths to each of the other two devices. For example, device A 100 may exchange information with device C 104 through any of three possible paths (via switches S1106 and S4112, respectively). The first exemplary path is a direct path connecting device A 100 directly to device C 104 through a port on switch S1106 and a port on switch S4112. A second path is through switch S1106 to switch S3110 and then through switch S4112. A third path is via switch S1106, switch S2108, and switch S4114. These three paths may be used as redundant communication paths connecting the two devices A 100 and C 104. Where a first path fails, the second path or third may be activated to assume responsibility for exchange of information between devices A and C. In like manner, there are three paths for communication between devices A 100 and B 102 and between devices B 102 and C 104.
Where redundant paths are available in such network computing environments, it remains a problem to effectively utilize the full available bandwidth. It would be desirable to utilize all redundant paths in parallel so as to increase the available communication bandwidth between two communicating devices. Where only a single path is used, the maximum bandwidth for exchange of information is limited to that of a single communication link. Where, on the other hand, all redundant links are used in parallel, the maximum communication bandwidth is increased by the number of links used in parallel. For example, as shown in FIG. 1, the communication bandwidth between any of the devices could, in theory, be increased by up to a factor of three.
However, as presently practiced in the art, protocols used among switch devices (e.g., S1106 through S4112) render such parallel communication paths unusable. Switches 105 through 112 as presently practiced in the art often use a protocol commonly referred to as xe2x80x9cspanning treexe2x80x9d to discover the existence of redundant communication paths as known to a network of switches. The spanning tree protocol is described in detail in a proposed IEEE standard P802.1p entitled Standard for Local and Metropolitan Area Networks Supplement to Media Access Control (MAC) Bridges: Traffic Class Expediting and Dynamic Multicast Filtering. 
The spanning tree protocol as implemented in switches broadcasts (more precisely multicasts) information from the switch out to all devices that recognize the selected multicast address connected to paths from the switch. A multicast message is one which is directed to all devices rather than to a particular destination address on the network. The information in the multicast message describes the address forwarding information known to that switch. From such information shared among all the switches, each switch can derive the various paths in the network. Each switch device so attached to the multicasting device receives the information and forwards (multicasts) the message to each device attached to it (except the path from which it directly received the message), and so on. If such a multicast message returns on a path to the originating device, a loop must exist among the paths connecting the various switches. To reduce the number of messages generated on the network by virtue of such multicast messages, the spanning tree protocol requires that redundant paths so discovered be disabled. In a large network without spanning tree protocol to disable redundant paths, received multicast messages can xe2x80x9ccascadexe2x80x9d from each receiving switch to all other attached switches. The volume of such cascading messages may grow rapidly or even exponentially. Such multicast messages exchanged among the switched may in fact require a substantial portion of the available communication bandwidth of the network. Such conditions are often referred to as xe2x80x9cbroadcast storms.xe2x80x9d
The spanning tree protocol therefore requires the disabling of redundant paths to avoid broadcast storms. Only when a path is known to have failed will a redundant path be enabled and used for the exchange of data. The spanning tree protocol therefore precludes aggregation of the available bandwidth to improve communication bandwidth by using multiple redundant paths in parallel. FIG. 2 is a block diagram of the same exemplary network of FIG. 1 where three communication links 114 between the switches have been disabled to prevent loops in the network and the resultant broadcast storm otherwise inherent in the spanning tree protocol.
Another problem with the spanning tree protocol arises from the fact that a preferred path may be unavailable due to the need to disable paths that cause loops among the switches. For example, as shown in FIG. 2, the preferred path between switches S1106 and S4112 may be the direct one which is disabled. To leave this direct communication link enabled would permit loops in the paths among the switches. Rather, a more circuitous route through switches S1, 106, S3110 and S4112 must be used to exchange information between switches S1106 and S4112. The spanning tree protocol does not assure that the best path between two switches will be left enabled. Rather, it merely attempts to assure that some path between switches is available, specifically, a relatively minimal path connecting all switchesxe2x80x94a spanning tree.
The spanning tree protocol therefore precludes maximizing use of available bandwidth in a network of switches.
Some switches have provided a partial solution to this problem by using a technique known as xe2x80x9ctrunking.xe2x80x9d Where there are multiple paths directly between two switches, the multiple paths serve as redundant communication paths but are trunked by the switches and treated logically as though they were a single path with higher maximum bandwidth. FIG. 3 is a block diagram of the same exemplary network environment of FIG. 2 wherein a plurality of communication paths between switch S1106 and S3110 are trunked. The communication path between switches S1106 and S3110 is therefore capable of using the trunked paths between them as though they were a single connection in terms of the spanning tree protocols. Since the redundant paths are treated as a single path for purposes of the spanning tree protocols, they need not be shut down to preclude broadcast storms.
However, trunking does not address the bandwidth issue in a broad sense. Rather, the trunking technique is only applicable where the multiple paths are between a particular pair of switches. The bandwidth limit is merely shifted from that of a single communication link to that of the number of communication links supported by a single switch.
It is a further problem that by precluding use of redundant links between switches, the spanning tree protocol also precludes the ability to balance communication loads among the redundant paths between switches. Where such multiple paths are allowed to simultaneously operate, it would be desirable for the switches to distribute the packet exchange communication among them over such multiple paths. Such distribution, often referred to as load balancing, further enhances the ability of the network to optimize the utilization of available throughput in the network of switches. Underutilized paths may be used to offload packet communication on overloaded paths.
It is therefore a problem in present networks of switches to simultaneously operate redundant paths between switches of the network to thereby maximize utilization of available bandwidth and to thereby communicate among the switches to balance communication loads over redundant paths.
It is a particular problem to efficiently propagate cost information associated with each link between switches of the network. As presently practiced in the art, cost information is propagated whenever there is a change in cost values associated with a link or switch in the network. The nature of many net applications tends to have bursts of network traffic as particular users or applications start and stop their receptive activities. These bursts represent potential significant changes in the loads among various switches in a network. It is therefore a problem to rapidly adapt the bandwidth utilization of a network of switches to balance the load of network packet exchange over a plurality of redundant paths.
This cost information is used by switches of the network to determine preferred or optimal paths between the switches for further exchange of packets. Optimal paths may vary depending upon the operation status of switches in the network. If a switch or link is temporarily disabled, cost information regarding related paths in the network may not be available until after an attempt to use the failed path. Alternate paths, presently optimal paths, are therefore potentially unknown to other switches in the network until time has been wasted in futile attempts to use the failed path.
It is evident from the above discussion that improved switches and associated protocols are needed to manage redundant communication paths while permitting improved bandwidth utilization of all communication links in a network. In particular, an improved cost information propagation protocol is needed to allow switches in the network to more efficiently discover changes in cost values in the network and to recompute optimal paths through the network resulting from changes in the cost information.
The present invention solves the above and other problems, thereby advancing the state of useful arts, by providing network switch devices and associated switch to switch protocols which permit the operation of multiple links throughout the network involving multiple switches, and which provide for improved utilization of the aggregate bandwidth of all paths in the network. Further, the present invention provides a switch to switch protocol propagating cost information among cooperating network switches operable in accordance with the present invention.
By permitting parallel use of all communication paths and switches in the network, the present invention improves scalability of network communication bandwidth as compared prior techniques. The aggregate bandwidth capability within the entire network may be increased by simply adding additional communication paths and associated switches.
In particular, the present invention includes a protocol operable between switches designed in accordance with the present invention for exchanging cost information regarding the various paths between the switches. Following the initial hello protocol and loop bit negotiation protocol to determine the switches which are in a load balance domain, the cost propagation protocol of he present invention is used to inform all switches of the relative costs of communicating over each of the paths interconnecting the switches.
As used herein, costs include any parameter values that affect the cost of transmitting data through the corresponding switch as measured in financial units and/or performance units. In the preferred embodiment, a single cost parameter is included which identifies the packet switching latency of a switch. The packet switching latency is a measure of the time required for the switch to process a packet received on one port and determine on which (if any) other port the packet should be forwarded. This latency time is a key aspect of the performance of networks having switches in their communication paths. Switches in general are designed to rapidly switch packets from reception port to a transmission port. However, configuration parameters and design factors within each switch may vary the latency time of packet processing within the switch.
The cost propagation protocols of the present invention therefore enable latency, and other cost information, to be efficiently exchanged among the switches of a network. Unlike prior cost propagation techniques, the methods of the present invention periodically generate cost information so as to maintain updated cost information within all switches of the network (all within a load balance domain). By periodically updating cost information, the present invention avoids congestion of network message traffic as compared to techniques which updated cost information each time any cost parameters changed.
Further, the protocol of the present invention is more robust than prior solutions. Should a switch in the load balance domain lose its cost information tables (perhaps by an inadvertent reset), the cost information will be updated at the next scheduled, periodic exchange.
At each periodic update of cost information, each switch generates a switch cost packet and transmits the packet out on each of its ports in the load balance domain. The switch cost packet includes a hop count field and at least one cost information portion. A switch which receives the switch cost packet updates its cost records in a table entry corresponding to the transmitting switch. As noted herein below, a loop bit offset value may be used as a unique ID to rapidly locate the corresponding table entry. The receiving switch then increments the hop count value in the switch cost packet and transmits the updated packet out on each of its ports within the load balance domain (other than the port from which the packet was just received). The process repeats for each switch receiving the updated switch cost packets from other switches until the process converges by updating all cost information in all switches of the load balance domain.
Several techniques are employed in parallel by the methods of the present invention to assure that the process will converge and converge rapidly with all switches receiving current cost information. First, if the hop count exceeds a predetermined maximum value, the packet is discarded rather than transmitted on to another switch. The predetermined value assures that the cost propagation packet will not xe2x80x9cwanderxe2x80x9d through the network of switches for longer than a predetermined maximum time (equivalent to the time required to traverse switches for the maximum hop count). Also, a retransmission counter in the cost packet is also used to assure that the packet is not xe2x80x9cstuckxe2x80x9d in a retransmission from one switch to another. Each cost information packet is acknowledged by the receiving switch with a packet (a cost information response packet) sent back to the originating switch. When the originating switch does not receive an acknowledgment after a timeout period, the cost information packet is retransmitted by the originating switch (through the port attached to the receiving switch which failed to acknowledge the previous transmission). The retransmission count is incremented each time the cost packet is retransmitted from a switch. If the retransmission count exceeds a predetermined count value, the cost packet transmission is terminated. Like the hop count this technique prevents a particular cost packet transmission from absorbing too much time within the transmitting switch.
Secondly, the loop bit offset value is used (as discussed herein below) as a unique identifier for each switch in a load balance domain. The value assigned to each switch is indicative of a bit position in a bit mask field of the cost packet (as well as other packets discussed herein below). Each switch which transmits a cost packet (the originator of the packet as well as each switch which retransmits the cost packet) sets its corresponding bit as described by the loop bit offset value. When a cost packet is received, if the receiving switch""s bit in the bit mask is set, the packet has made a loop back to the receiving switch. In such a case, the received cost packet is discarded (not transmitted further) because it has already been processed by the receiving switch.
A sequence number field in the packets is used to identify a particular occurrence of the periodic exchange of cost information among the switches. Until the cost information exchanges among the switches converges, all cost information packets associated with this occurrence of the exchange use the same sequence number. Each switch increments the sequence number field in cost information packets generated by that switch prior to the next periodic exchange of cost information. The sequence number for cost packets received from other switches to be propagated through the load balance domain is left intact as the cost information is forwarded for propagation in the load balance domain.
A current set of cost information saved within the switches is retained while the next periodic exchange of cost information is attempting to converge. New cost information eventually replaces previous cost information in the switches once the exchange of the new cost information converges to completion among the switches.
Still another aspect of the present invention lies in the architecture wherein cost information is calculated and propagated at the physical layer of the network communication protocols. As is known in the art, computer network communication protocols are often designed in accordance with a standard layered model. For example, the well known ISO (International Standards Organization) defines an OSI (Open Systems Interconnect) standard model for computer network communications. The model has seven layers with the lowest layer, the link layer, being designated as layer 1.
The cost calculation and propagation of the present invention is performed at the physical layer (i.e., layer 2 of the seven layer model). By comparison, prior techniques performed such cost information exchange at layers 3 and higher. By computing and exchanging such information at the physical layer (layer 2), the methods and switches of the present invention more readily utilize real-time, physical link performance information and measurements to estimate actual link costs.
The cost information propagation protocol of the present invention therefore rapidly converges on complete dissemination of cost information among all switches of a load balance domain. Further, the protocols of the present invention provide consistent periodic updates of cost information about all switches in the load balance domain to all switches in the domain. Still further, the protocols of the present invention estimate actual link costs using information available at lower layers of the protocol xe2x80x9cstackxe2x80x9d (i.e., at the physical level (layer 2) of the ISO/OSI model).
The above, and other features, aspects and advantages of the present invention will become apparent from the following descriptions and attached drawings.