Networks can be broadly classified as Local Area Networks (LANs), Metropolitan Area Networks (MANs) and Wide Area Networks (WANs). These networks connect electronic devices together and enable them to communicate with one another. The electronic devices may include terminal communication devices (e.g. smart phones, laptops, computers, tablets, etc.), servers, hosts (processing units such as computers, printers or other peripheral devices), controllers, switches, gateways, and other network elements. These electronic devices in the network communicate with each other through communication channels. Underlying these channels are various physical devices. Examples of the physical devices include adapters that connect various network elements to the network, a cable or a bus that connects the adapters to a port on a network hub, the network switches that provide connectivity to each network element and the cables or buses that interconnect these network switches.
The full operation of a channel may be disrupted by a failure in any one of these underlying physical devices. The loss of communication can also take place in the case of a failure in the cable connecting an adapter to a network switch, or the port on the network switch to which the network element connects. The failure of some physical devices might also cause several network elements to lose their ability to communicate with one another. For example, if one of the network switches underlying a channel fails, then all the network elements that are connected through that network switch will lose their ability to communicate on that channel. However, other network elements, which connect to that channel through an operational underlying network switch, may not lose their ability to communicate on that same channel. This is an instance of a partially operational channel. A channel is said to be fully operational if connectivity to that channel is operational for all network elements configured to communicate on that channel.
Most networks offer many channels or pathways through which devices are connected to and communicate with one another. Networks are usually designed so that if one channel, pathway or device fails in a network then communication among the network elements can be rerouted through another pathway, channel or device. There are several methods which are known in the art for monitoring communication networks, identifying failures and rerouting communications in the network. Such systems and methods typically involve not only monitoring but also providing network element status reports to participants in the network. An example of a system and method for monitoring communications in a data network is disclosed in United States Patent Application Publication Nos. 2005/0144505 and 2006/0126654.
The loss of communication may be considered temporary if a connection is restored within a few seconds or a few minutes. The failure may disrupt communication to or from some but not all of the network elements in the network. If several network elements are communicating with one another when a failure occurs that affects some but not all of those network elements, some network elements will receive information that other network elements do not receive.
Consequently we have determined that there is a need for a method that will not only identify network failures, but also track communications activity that occurs during a failure and restart of an application by user devices utilizing a service when communication to that service is restored. In very large scale deployments, for example 40,000 to 50,000 users, the recovery of an application session by some users and its associated device monitors is only one aspect that impacts system performance during a computing function recovery phase of the recovery. To obtain current information, the computing function must also discover the current call and non-call status of each device in order to properly reflect this status toward its end users, which is typically not supported or provided for in most communication systems.
A new system and method is needed for recovering application sessions that temporarily experience a failure. We have developed such a system and method. Embodiments of our system and method may permit user device activity to be monitored during an outage so that such data may be utilized for updating an application restarted by a device that experienced a failure if that device recovers or restarts the application within a predetermined time period after the outage occurs.