This invention relates to communications networks. More particularly, this invention relates to an improved system and method for quickly recovering from failures or error conditions.
A telecommunications network transports information from a source to a destination. The source and destination may be in close proximity, such as in an office environment, or thousands of miles apart, such as in a system transmitting credit card transaction data throughout the United States. The information (traffic), which may be, for example, computer data, voice transmissions, or video programming, usually enters and leaves a network at nodes (also termed backbone switches or offices), and is transported through the network via links and nodes. The overall traffic comprises multiple data streams which may be combined in various ways and sent on common links.
Nodes are devices or structures that direct traffic into, out of, and through a network. They can be implemented electronically, mechanically, optically, or in combinations thereof, and are known in the art. Nodes range in complexity from simple switching or relay devices to entire buildings containing thousands of devices and controls. Nodes in a network can be controlled by a central network operations center (xe2x80x9cNOCxe2x80x9d) and can be programmed with varying degrees of automated traffic-managing capabilities. Links, which may be termed trunks, connect nodes and transmit data between nodes.
A node may become inoperative in a number of ways: for example, power outage, flood or an abnormal amount of messages flooding the network. A link can become inoperative in numerous ways, but most often becomes inoperative as a result of being cut. A network error condition or a network failure is any condition or occurrence that adversely affects the performance of a network or interrupts traffic flow; such a condition may affect only a portion of the network. For example, an error condition may be the failure of a link, a software or control failure, or an overload condition.
Because of the significant volume of traffic typically transported by a network, any disruption in traffic flow can be devastating to large numbers of users transmitting information. The ability to quickly restore network service should a portion of the network become inoperative is of high priority.
A frame relay network is a communications network which transmits data of variable length packets between two points. A frame relay network may accept data in a frame relay format, convert the data to asynchronous transfer mode (xe2x80x9cATMxe2x80x9d), transmit the data in ATM form, and convert the data back to a frame relay form when the data leaves the network. ATM uses packets of a fixed length. Thus in such a network a variable length frame entering the network may be broken up into multiple packets of a set length, which are reassembled into the frame when the data leaves the network.
Traffic is routed through a network via a path, a physical or logical route between two points in a network. A path between any two nodes is a route allowing for data transmission between those two nodes; a path may be one link, or may be comprised of multiple links and nodes and other network elements. The length of a path is an indication of the amount of equipment comprising the path; for example, meters of fiber or number of hops (links separated by nodes). A network may transmit data via virtual circuits. A virtual circuit is a path transmitting data between two endpoints in a manner giving the appearance that a dedicated path exists between the two endpoints; in reality any of numerous paths, each path having multiple links and nodes, may be used to connect the two endpoints. For any number of reasons a network may reconfigure a virtual circuit, i.e., change the routing scheme of the virtual circuit.
In one frame relay network, users transmitting data may have a router at a user site for connecting with the frame relay network via an edge vehicle switch (located remotely from the user site) which in turn connects to a node in the network. A user sends data in frame relay form to the network via the router and edge vehicle switch.
The edge vehicle switch converts the data to packets of standard length. The packets are sent through the network via a virtual circuit. An edge vehicle switch connecting to one end of the virtual circuit converts the data to frame relay form and transmits the data to a router located at a user site.
A permanent virtual circuit (xe2x80x9cPVCxe2x80x9d) is a virtual circuit having a path which is relatively stable over time. In one known network, each PVC is owned by a master node. The master node owning a PVC establishes, monitors, and maintains the PVC and is typically one of the two endpoint nodes for the PVC. Each node is responsible for allocating the capacity of the trunks directly connected to it. Most, if not all, nodes in such a network are both master nodes, owning many PVCs, and via nodes, part of many PVCs owned by other nodes. Establishing a PVC involves finding a path for the PVC. The master node determines a path based on its knowledge of network capacity and transmits requests to numerous potential via nodes in the network. A requested via node responds negatively to a request only if the master node is incorrect as to the trunk capacity allocated by the requested via node, and the trunks for which the requested via node is responsible do not have the capacity to participate in the PVC.
In such a network a failure of a network component, e.g., a node or link, affects multiple PVCs. For example, if one node fails, data cannot flow on the numerous PVCs which use that node as a via node. The affected PVCs must be rerouted: for each PVC the master node owning the PVC must select a set of nodes from the remaining healthy nodes in the network to re-form the PVC. This must be done quickly, and must be done for numerous PVCs, as the failure of even a single node or link may interrupt data transmission for many PVCs.
The reestablishment of a PVC requires the use of network resources such as the processing time of nodes and the communications resources of the network. In certain networks, on the occurrence of relatively small failures, e.g., the failure of two nodes in a 200 node network, master nodes may recover (i.e., reestablish their PVCs and perform other tasks) simultaneously without interfering significantly with each other""s recovery. However, on the occurrence of a major disruption, for example, the failure of a majority of the nodes, the load on various network resources from recovering nodes results in interference between nodes trying to reestablish PVCs, which results in inefficiencies delaying overall network recovery.
As part of a node""s recovery process, the node queries and receives responses from other nodes to determine whether the other nodes may become via nodes in PVCs owned by the node. Potential via nodes may accept or decline to become part of a PVC based on the capacity of trunks local to the potential via node and on the resource requirements of the PVC. While a potential via node is being queried by one master node, it is unavailable for querying by another master node. Furthermore, when a via node accepts a master node request, it must reconfigure its equipment to become part of that PVC; it is unavailable to respond to other PVC requests during this time. When a node is unable to respond to the PVC request of a second master node because it is responding to the PVC request of a first master node, a collision occurs; the second master node must back-off and attempt the reroute of the entire PVC at a later time. A collision may also occur if a first master node queries a second master node which is busy making a via request of a potential via node. In general, a collision occurs when two objects or devices in a system attempt to access the same resource at the same time, when the resource can service only one object or device.
A collision and the subsequent reroute reattempt waste the resources of both the master node and the nodes already existing in the PVC being constructed, and lengthen the recovery time of the master node and thus that of the entire network. Since a via node is typically part of multiple PVCs, one master node""s recovery may thus interfere with the recovery of other master nodes. This creates problems when major network outages occur and large numbers of nodes are attempting to recover simultaneously. A master node making a reroute attempt and experiencing a collision after rerouting a portion of a PVC has, during its reroute attempt, created interference with other master nodes which is unnecessary, as that master node has not achieved an actual reroute.
During normal operations and during recovery, each node in such a network performs a certain amount of background processing. Each node has a certain amount of processor capacity, used for background processing and rerouting activities. Rerouting activities load a node""s processor, increasing processor occupancy (a measure of the fraction of time a processor is working as opposed to idle). Rerouting PVCs requires a certain amount of processing in addition to background processing on the part of a master node (generally resulting from contacting via nodes). If the combination of the amounts of background processing and rerouting processing increase enough, the amount of rerouting able to be done may be limited. A request to a via node to participate in a PVC consumes a portion of the processing resources for that node. In addition, a request also prevents another request from taking place to that via node. Thus a collision increases the processing load of both master and via nodes and increases the time for overall network recovery.
Timing limitations on nodes in a network may be created to minimize interference among the nodes during recovery. A node may be given a set time to wait before re-querying a via node or between PVC creation reattempts; this time may be increased in the event of a collision. Increasing the interval between master node queries of via nodes or PVC creation reattempts decreases interference, which increases the efficiency of the recovery; however, increasing this interval increases the recovery time of each node and thus of the whole network. These factors must be balanced if an interval is to be effective in improving overall recovery time. Current methods of creating timing limitations, relying on a dynamic increase in a timing delay in response to collisions, do not optimally reflect the relationship between network interference and reroute attempts, and do not balance the need to avoid interference and the need to recover quickly.
In current networks, nodes may be manually divided into sets which recover at different times. The sets are created according to an operator""s guess as to the interference between nodes; such a guess may be based on, for example, the geographic location of the nodes. Such a method provides at best an approximation of the true interference between nodes in a network, which may be based on a complex network and PVC architecture. A method of dividing the nodes in a network into sets based on accurate information as to the interference between recovering nodes does not exist.
Two or more nodes recovering at the same time interfere minimally with each other when, according to some measure, each nodes"" queries to via nodes results in a minimum number of collisions with other nodes"" queries to via nodes. Objects or devices, including nodes, may interfere with each other""s operation when they compete for the same resource; such interference may occur, for example, when two nodes in a network make a via request to the same node during the same time period. Two or more objects competing for resources at the same time interfere minimally with each other when each object""s use of the resources results in a minimum number of collisions with the other object""s use of the resources. Entities such as objects or devices may be, for example, applications or modules in a computing device, physical nodes competing for access to other nodes, or any other entities which perform activities which may interfere with or compete with each other. Activities entities perform may be, for example, accessing limited operating system resources, communicating with nodes, or any other activity.
In view of the foregoing, there is a need for organizing the activities of items such as devices, objects, or nodes competing for the same resources such that the interference between items is minimized. There is a need for a method to determine the amount of interference between devices, objects, or nodes competing for the same resources (e.g., master nodes in a network competing for via nodes). There is a need for organizing devices, objects, or nodes competing for the same resources into sets such that the interference between items within sets is minimized; if such sets access resources in overlapping time periods there is a need to provide an optimum sequence for such activity so that interference among sets is minimized. There is a need to provide a timing limitation on objects such as nodes competing for resources which accurately reflects the relationship between accessing resources and interference among access attempts.
It would be desirable to provide a system and method for allowing a set of nodes in a network to reestablish connections in the network in the quickest manner possible. There is a need for a system and method for organizing the recovery of network equipment to minimize interference between the network components and thus maximize recovery efficiency and speed. There is a need to provide a measure of inter-node recovery interference and to separate nodes into sets, or into a sequence of sets, where this interference is minimized. There is a need to provide an accurate timing limitation on recovering nodes in a network to minimize interference.
An embodiment of the system and method of the present invention organizes the recovery of a communications network to minimize the interference between the recovering nodes and thus allows for a faster recovery. Alternate embodiments may organize any system of objects or devices competing for resources to minimize interference between the resources.
An embodiment of the system and method of the present invention calculates a metered rate at which nodes recover from a major network failure, based on the architecture of the network and the characteristics of the nodes in the network, and of the virtual circuits forming the network. An optimum metered rate is calculated, at which the network recovers quickly but without performance degrading interference. An embodiment creates a measure of the interference between recovering nodes; the measure of interference may be used to partition the set of recovering nodes into subsets, where the recovery process of each node within a subset interferes minimally with the recovery processes of other nodes within that subset. The subsets recover at different times, reducing overall recovery interference and speeding recovery. Recovering items (nodes or sets of nodes) may be sequenced so that each item recovers substantially separately in time, but where adjacent sequence items recover with some temporal overlap. The amount of interference occurring between adjacent items in the sequence is minimized.
Embodiments of the system and method of the present invention may organize the activities of any system of items such as objects or devices such that interference among items is minimized. An optimum rate for activities undertaken by items may be created to minimize overall interference. A measure of interference between items competing for resources may be created; this measure may be used to partition and sequence the items.