The present invention relates in general to the field of network communication. Specifically, the present invention provides a method and system for coordinated monitoring and failure detection of one or more Local Area Network channels (LAN channels), thereby restoring full connectivity between hosts in the network.
Networks can be broadly classified as Local Area Networks (LANs), Metropolitan Area Networks (MANs) and Wide Area Networks (WANs). Of these, a LAN is a system that is restricted to a few miles and uses high-speed connections. It is a short-haul communication system that connects electronic devices in a building or a group of buildings within a few square kilometers. The electronic devices may include hosts (processing units such as computers, printers or other peripheral devices), controllers, switches, and gateways. These electronic devices in the network communicate with each other through communication channels. These communication channels are generally referred to as LAN channels. Underlying a LAN channel are various physical devices. Examples of the physical devices include LAN adapters that connect various hosts to the network, a cable or a bus that connects the LAN adapters to a port on a network hub, the network switches that provide connectivity to each host, and the cables or buses that interconnect these network switches.
The full operation of a LAN channel may be disrupted by a failure in any one of these underlying physical devices. Such a failure is commonly referred to as a ‘single point of failure’. A single host may lose its ability to communicate on the LAN channel if its LAN adapter fails. The loss of communication can also take place in the case of a failure in the cable connecting a LAN adapter to a network switch, or the port on the network switch to which the host connects. The failure of some physical devices might also cause several hosts to lose their ability to communicate on the LAN channel. For example, if one of the network switches underlying a LAN channel fails, then all the hosts that are connected through that network switch will lose their ability to communicate on that LAN channel. However, other hosts, which connect to the LAN channel through an operational underlying network switch, may not lose their ability to communicate on that same LAN channel. This is an instance of a partially operational LAN channel. A LAN channel is said to be fully operational if connectivity to that LAN channel is operational for all hosts configured to communicate on that LAN channel.
Resiliency is the ability of the network to maintain a fully operational communication channel, in spite of the failure of one or more physical devices underlying the communication channel. This preserves the ability of all the hosts to communicate with each other. Networks can be designed in a variety of ways, providing varying degrees of resiliency. For example, a host may be configured with a plurality of LAN adapters, each of which connects the host to the same LAN channel, but only one of them (known as the active LAN adapter) is used at any time. The others (known as standby LAN adapters) remain inactive. If either the active LAN adapter or the cable connecting the active LAN adapter to the network switch fails, a standby LAN adapter can be used to restore its connectivity to the LAN channel. If all of the hosts are configured in this manner, then the LAN channel can remain operational even in the event of the failure of the active LAN adapter in the multiple hosts. On the other hand, if the network switch that provides connectivity to one or more hosts fails, then the LAN channel may not be fully operational. This could happen even when all the hosts are configured with a plurality of LAN adapters, unless multiple network switches are used to create the LAN channel, or the active and standby LAN adapters in each host are connected to a different network switch, and all the network switches underlying the LAN channel are interconnected to each other. Such a configuration would enable the LAN channel to remain fully operational in spite of the failure of any one physical component underlying that LAN channel.
However, in addition to the additional physical network devices and interconnections, a control is required to ensure that only one path among the plurality of physical paths possible under such a configuration is the underlying path through which the communication takes place. Without this control, network loops may be created in the network, and these loops can cause a highly undesirable effect on the LAN channel, e.g., broadcast storms, and the failure of the LAN channel to operate efficiently under certain conditions.
Therefore, a method is required to maintain the optimality of the underlying path, as far as its performance is concerned. The underlying path between two hosts is optimal if that path traverses a minimum number of intervening switches. This minimizes latency in communication. Without a method to coordinate and control the selection of an alternate path in the event of a failure in the LAN channel, the alternate path may include the traversal of more switches than were included in the original path. In this scenario, although the network is resilient from a connectivity perspective, it is suboptimal from a performance perspective.
There are techniques available in the art for choosing a path at the time of the failure of a channel. One such technique is the Spanning Tree Protocol (STP). The STP defines a tree that spans all the switches in the network. Further, the STP forces certain redundant data paths into a standby (blocked) state. If one network segment in the STP becomes unreachable, the STP algorithm reconfigures the spanning-tree topology and re-establishes the link by activating the standby path. The algorithm calculates the cost of communication of all the possible tree formations and selects the one with the lowest cost of communication. The cost of communication of a segment of the channel is defined as a standard data rate divided by the bandwidth of the segment and is typically based on a guideline established as part of 802.1d of IEEE standards. The aggregate of all the costs of segments throughout the channel is known as cost of communication of that channel.
Although, the STP tries to optimize the communication between any two points in a network, it fails to ensure an optimal path between two hosts at the time of the failure of the channel. This is because there is no provision for updating the STP regarding the failure of a communication channel outside its realm of operation. Other limitations of the STP include its complexity and high cost of operation. The STP also requires intricate network design, exhaustive failure testing, and expensive maintenance. The STP also carries a huge convergence time, which results in latency at the time of the failure of the channel. Latency is the delay in the communication of data packets in the network, and is a result of the processing of a packet as it propagates from one node to another in the network.
To achieve optimal communication in a network along with minimum latency, the communication status at every node in the network has to be dynamically monitored, and usage of network resources has to be coordinated when a communication channel fails. There are systems available in the prior art that provide techniques for monitoring and coordinating usage of resources in a network. One such technique known in the art is described in U.S. Patent Application No. US20020126635, entitled ‘System and Method for Switching between Frequency Channels in Wireless LAN’, filed by the KDDI Corporation. The technique provides a method for switching frequencies in a wireless LAN. According to this technique, a manager, which is a part of a switching system, monitors the line condition with the help of the stations. The line condition, as found by the manager, is then communicated to a frequency channel switch. The frequency channel switch selects the frequency channel on the basis of the judgment of a judging unit. In the case of a changeover to other communication channels, the switching unit sends a request for the changeover to all stations, and coordinates a changeover to the alternate communication channel. The technique is related to preserving the overall quality of communication, by dynamically monitoring the state of communication. However, the system does not address the method of recovery from a single point of failure while preserving the optimality of the communication path. Further, the decision to switch is made by the switching apparatus, which may not be optimal for all the stations.
The prior art techniques described above suffer from one or more of the following limitations. First, these techniques are not able to ensure the optimal path while providing resiliency. Second, these techniques do not avoid the latency while identifying an alternate channel, if a channel fails. Third, these techniques do not ensure optimal usage of LAN resources. Fourth, the choice of switching to an alternate channel is unilaterally made by the switching module on behalf of all the hosts. Fifth, the alternate channels are not actively monitored continually for their readiness to be adopted by some or all of the hosts. This potentially causes the hosts to change channels again if the first alternate channel also cannot be adopted.
In light of the above discussion, there is a need for a method and system for providing an optimal path for communication, at times of a failure. The system should employ network resources optimally, thereby minimizing the requirements for worst-case connectivity. The system should also provide a mechanism for choosing an alternate channel with minimum latency, if a channel fails. The system should permit the hosts to participate and coordinate a changeover to an alternate channel. Finally, the system should periodically test the operability of all alternate channels and update all hosts on the status of these alternate channels.