This invention relates in general to network management systems and methods, and more particularly to a method and system for asymmetrically maintaining system operability.
Network management systems that are used to monitor and administer complex data processing and communications systems such as telecommunication systems typically use fault detection or other techniques to identify failures within the systems. One technique used to temporarily avoid complete system failure resulting from the failure of an element therein, such as a computer or server, includes the use of redundant elements that may take over or perform the functions for the failed element. In many cases, typical fault detection measures may not detect certain failures in a reasonable time within communication media such as a bus, or within integrated circuits.
Many systems employ keepalive mechanisms to detect failures within the systems. Keepalive mechanisms are typically similarly or identically configured within two elements. Thus, each element is able to monitor the other and to detect whether the other has failed within a reasonable amount of time, to avoid disruption of system operability. In typical keepalive mechanisms, each element transmits messages to the other element, expecting a response to each message to be reflected back. After several messages have been transmitted with no response, the sending element assumes that the other element has failed.
A problem arises when these two elements monitor each other. Should a communication medium, such as a bus or other communication path, between the two elements fail, none of the messages sent or received between the elements reach the other. Thus, each of the elements may erroneously believe that the other has failed, and take actions that disrupt the system, or cause a system crash. For example, each element may attempt to access the same data, or the same address on the bus. Accordingly, a need has arisen for a system and method for asymmetric failure detection that maintains system operability even in the event of a failure of a communication path between two elements.
In accordance with the present invention, a system and method for asymmetrically maintaining system operability is provided that substantially eliminates or reduces disadvantages or problems associated with previously developed network management systems and methods.
In one embodiment of the present invention, a system is provided for asymmetrically maintaining system operability that includes a first processing element and a second processing element coupled to the first processing element by a communication link. The first processing element is operable to perform at least one function. The second processing element is operable to perform at least one function of the first processing element in the event the first processing element fails, and further operable to expect and receive keepalive inquiries at an expected rate from the first processing element and to send responses in response to the inquiries to the first processing element. The second processing element is further operable to take remedial action after not receiving any inquiries within a first predetermined time period. In another embodiment of the present invention, the first processing element is operable to take remedial action after not receiving any response to any inquiries sent within a second predetermined time period, wherein the first predetermined time period is larger than the second predetermined time period. In other embodiments of the present invention, the first and second processing elements are routers.
An important technical advantage of the present invention inheres in the fact that the system establishes a primary element and a secondary element to detect failures within the system. The invention includes the ability to detect whether the primary or secondary elements have failed, and to maintain the operability of the system in the event of such a failure. For example, the primary and secondary elements may send a service request or sound an alarm. The present invention also includes the ability to detect and maintain operation in case of a failure of the communication link between the primary and secondary elements. Another advantage of the present invention is the ability to avoid a system state where both the primary and secondary elements believe that the other has failed, where each element may reset or disable the other, rendering the system inoperable. Other technical advantages will be readily apparent to one skilled in the art from the following figures, descriptions, and claims.