1. Field of the Invention
The present invention relates to the operation management of network systems used in intranets in environments of Internet data centers (IDCs) or the like, and relates to an apparatus and a method for automatically squeezing positions that are plausible candidates for having been involved in a failure that caused a communication abnormality; the failure occurs in the network in which the communication abnormality occurs.
2. Description of the Related Art
In the field of failure detection in the operation management of networks, the status of a network is monitored by employing a configuration in which a test communication is periodically performed between two points in the network, and it is confirmed that the test communication is normally completed.
FIG. 1 shows an example of the above network system. In FIG. 1, a network 101 that is to be monitored comprises a wide area IP (Internet Protocol) communication network 116 and the following devices.    Spoke routers 111 through 115    Routers 117 and 118    Switches (SW) 119, 120, 123, 124, 127, 128, 133, 134, 137, 138, 141, 142, 147, and 148    Firewalls 121, 122, 135, and 136    Server load balancers 125, 126, 139, and 140    Web servers 129 through 132    Application servers 143 through 146    Database servers 149 and 150
In this configuration, the wide area IP communication network 116 functions as an IP-VAN (Internet Protocol-Virtual Private Network). There are two methods for realizing the test communication and realizing the acquisition of data of the result of the test communication, as below.
(a) An operation management server 102 is provided at a particular point in the network as shown in FIG. 1, and test communication with respect to the respective nodes (devices) in the network that is to be monitored is periodically performed from the operation management server 102 via switches 151 through 155. Then, both whether or not the communication can be successfully performed and the status of the communication are checked by utilizing a communication that is based on ping (Packet Internet Groper), SNMP (Simple Network Management Protocol), or the like. The communication paths are not taken into consideration. The data of the check results is accumulated in the operation management server 102, and the results are reported to a network manager 103 by way of, for example, displaying the devices involving failures on a diagram showing the network.(b) Agent programs for monitoring communication are installed in a plurality of nodes 118, 132, 145, and 149 in the network 101 as shown in FIG. 2 (node 118 often includes the preloaded agent program because it is a router). Then, test communication is performed between the agents, and both whether or not the communication can be successfully performed and the status of the communication are checked, and the result is transferred to the operation management server 102. Thereafter, the results are reported to a network manager 103 by way of, for example, displaying the devices involving failures on a diagram showing the network. In this configuration, the information on the communication path between the nodes in which the agent programs are installed is not utilized.
In both methods (A) and (B), when it is determined by the operation management server 102 that the test communication involves an abnormality, it is reported to the network manager 103 that the network communication status is abnormal with respect to the corresponding nodes by way of, for example, displaying the event on a screen.
However, in both methods, the fact that is grasped is whether or not communication is normal at a particular time and between “two particular points”, and this fact is not grasped by the operation management server 102 for which communication paths lie between the two particular points. The methods of detecting failures in networks based on the above configurations involve the following problems.
(1) When communication between two points involves an abnormality, it cannot be ascertained where (in which part) between the two points the failure occurred that has caused the abnormality.
To begin with, failure detection in networks mainly aims at shortening the time period during which communication is in an abnormal state by quickly recovering the network when the failure occurs, and if the network is to be recovered quickly it is important to find, in a short time, the cause of the communication abnormality, i.e., to locate the position at which the failure has occurred.
Generally in network communication, even if there is only one position on a communication path that is blocking communication, the communication cannot be performed normally. This means that when communication between two devices involves an abnormality and there are many network devices between those two devices, the two nodes themselves and all the network devices between the two nodes are plausible candidates for having been involved in the failure that caused the abnormality, which constitutes a very large pool of candidates. When a network manager has to find and cure the failure that caused the communication abnormality, all of these nodes and devices have to be examined.
The network manager has to examine all of the operation management servers 102, the routers 117, and the switches 151 through 154 when, for example, a communication abnormality occurs between the operation management server 102 and the router 117 in the configuration shown in FIG. 3.
It is possible to squeeze the positions that are plausible candidates for being involved in a failure on the basis of an analysis and judgment of the network manager by combining information of the communication abnormality with information that indicates normal communication between other pairs of points. However, accuracy of judgment decreases when the judgment is made by a human being, and it takes a longer time to squeeze the positions that are plausible candidates for having been involved in a failure, which results in a longer wait for the network to recover from the abnormal state.
(2) When communication between two points involves an abnormality and it is assumed that there is a plurality of physical communication paths between the two points, it cannot be ascertained which communication path contains the communication involving the abnormality.
Even in case (1), when there is only one physical communication path between the two points that the abnormality is between, it is difficult to squeeze the positions that are plausible candidates for having been involved in a failure. Further, intranets and the Internet are often configured to have a plurality of physical communication paths between two nodes. In such cases, the positions that are plausible candidates for having been involved in a failure that caused the communication abnormality between two points includes all of the devices disposed on the physical communication paths that could be used for the communication. This means that a long time is required to solve the problem of the communication abnormality, as described in case (1).
For example, when a communication abnormality occurs between the web server 132 and the application server 146 in FIG. 3, all the devices included in areas 301 through 303 are plausible candidates for having been involved in a failure.
If an investigation to try to grasp which communication path was used for the abnormal communication between the two points can be conducted after the detection of the abnormal communication in the above case, it will be possible to squeeze the positions that are plausible candidates for having been involved in a failure. However, this type of investigation is generally thought to be difficult. This is because it is impossible to perform actual communication between the two points for the confirmation because the communication between the two points has already been involved in the abnormality.
(3) When an abnormality is detected in a communication between two points, it is impossible to grasp the incidences over which the abnormality has influence and the urgency with respect to services.
For example, a communication abnormality between two points could be detected in an intranet and there could be two networks between the two points, i.e., a network that is used for customer services and has a high importance, and a network that is used as a spare network when an abnormality occurs and has a low importance.
If the position at which the failure has occurred is included in a device used for the network with a high importance, the situation has to be dealt with urgently because this failure has influence on customer services. In contrast, the incidence of the abnormality is not extensive if the failure has occurred in a device used for the network with a low importance, and therefore the situation can often be dealt with posteriorly.
The network manager cannot determine whether or not the failure has occurred in the network with a high importance on the basis of only information reporting that there is a communication abnormality between the two points. In actual cases, it often happens that even when an incidence of failure is not extensive and the situation does not have to be dealt with urgently, the situation is dealt with urgently because the possibility of a serious failure is taken into consideration, and an unnecessarily high labor cost often results.
Additionally, a network system that locates the position in which a failure has occurred on the basis of alarm information issued by a constituent element in the network when the failure has occurred in the network (Patent Document 1 for example)    Patent Document 1    Japanese Patent Application Publication No. 2003-179601