Hitherto, networks (for example, an internet protocol (IP) network), such as a cloud system in which hardware resources are over a plurality of data centers, which connect many physical servers executing virtual servers (virtual machines) and a storage to each other, have been known.
In the above-described networks, for example, when a failure occurs due to a breakdown or the like of a link, a physical server, or a storage, the influence of the failure on services which are provided through the network increases with an increase in the number of virtual servers in the network. Accordingly, it is important for a manager of each data center (center) to rapidly specify a broken-down part when detecting the occurrence of a failure in the own center and to evacuate a virtual server to be influenced by moving the virtual server to another physical server to thereby restore the services.
Here, the example depicted in FIG. 30 is exemplified as a method of detecting the occurrence of a failure and specifying a failure part in the center. FIG. 30 is a diagram illustrating an example of a method of detecting the occurrence of a failure and specifying a failure part in a center 101.
As depicted in FIG. 30, the center 101 includes a management device 110, a monitoring device 200, a switch 300, a router 400, a plurality of (e.g., four) server devices 500-1 to 500-4, and a storage 600.
The switch 300 is a device which connects the server devices 500-1 to 500-4 to the storage 600, and retains connection information indicating a connection relationship between the switch 300 and each device connected to the switch 300. The router 400 is a device which is connected to the switch 300 and also connected to another center 102 to relay a command or information such as data which is transferred between the center 101 and another center 102.
Through a local area network (LAN) cable or the like, the server devices 500-1 to 500-4 and the storage 600 are connected to the switch 300, and the switch 300 is connected to the router 400. The management device 110 and the monitoring device 200 are also connected to the switch 300 through a LAN cable or the like.
The server devices 500-1 to 500-4 each include hardware such as a central processing unit (CPU) and a memory to execute one or more virtual machines (VM) 501-1 to 501-4, respectively. The VMs 501-1 to 501-4 are virtual machines which are used for one user.
In addition, in the example depicted in FIG. 30, the VM 501-1 which is executed by the server device 500-1 is a VM for measurement and subjects other server devices 500-2 to 500-4 and the storage 600 to a communication test using a Ping or the like.
The storage 600 is a hardware resource including one or more storage devices such as a hard disk drive (HDD) and is used by the VMs 501-1 to 501-4.
The VMs 501-1 to 501-4 and the storage 600 form a network for a user.
The management device 110 manages the VMs 501-1 to 501-4 which are executed in the center 101. For example, the management device 110 retains information indicating which server device among the server devices 500-1 to 500-4 the VMs 501-1 to 501-4 are accommodated in.
The monitoring device 200 is a device which specifies a failure part caused in the center 101. Specifically, the monitoring device 200 collects, from the switch 300 and the like, topology information of the center 101 in order to specify the failure part, and collects, from the management device 110, information indicating which server device among the server devices 500-1 to 500-4 the VMs 501-1 to 501-4 are accommodated in.
For example, in FIG. 30, the VM 501-1 for measurement executes a communication test on other VMs 501-2 to 501-4 (server devices 500-2 to 500-4) and the storage 600 on the basis of IP addresses in the network for a user (see the arrows F1 to F4).
Hereinafter, a case in which as a result of the communication test, the VM 501-1 for measurement confirms the communication with respect to the server devices 500-2 and 500-4 and the storage 600, but fails to confirm the communication with respect to the server device 500-3 will be assumed. Examples of the case in which the communication fails to be confirmed include a case in which a packet loss or delay is caused due to a Ping. In this case, the monitoring device 200 specifies a failure part using tomography analysis or the like from the acquired topology information and the route (arrow F3 of FIG. 30) to the server device 500-3 which has failed in the Ping.
Upon specifying a port of the server device 500-3 as a failure part, the monitoring device 200 specifies the VM 501-3 as a virtual server which is influenced by the failure on the basis of the information acquired from the management device 110, and specifies a user who is influenced by the failure as a user of the VM 501-3. In addition, the monitoring device 200 outputs, to a display device (not depicted) or the like, information on the failure part, the user who is influenced by the failure, and the VM 501-3 which is a virtual server to be moved to another server device, and moves the VM 501-3 to another server device with respect to the management device 110.
In some cases, a virtual server is moved to the center 101 depicted in FIG. 30 from another center 102 connected through the router 400 by live migration or the like. The live migration means that the virtual server is moved to another server device by the management system of the management device 110 or the like while being operated continuously.
FIG. 31 is a diagram illustrating an example of a case in which a virtual server is moved over two centers 103 and 104. In FIG. 31, since devices having reference numerals and symbols common to those in FIG. 30 have the same configurations as the devices depicted in FIG. 30, the description thereof will be omitted.
As depicted in FIG. 31, a case in which a VM 501-6 which uses a storage 600-2 of the second center 104 is moved from a server device 500-8 of the second center to a server device 500-6 of the first center 103 will be considered (see the arrows (i) and (ii) of FIG. 31). In this case, a VM 501-5 for measurement also confirms communication with respect to the server device 500-6 which accommodates the VM 501-6 moved from the second center 104, and a monitoring device 200-1 specifies a failure part when a failure is detected.
As a related technique, a technique in which in a cluster system including an active device and a standby device, communication results are confirmed through a hypervisor using a disk monitoring function of each of the active device and the standby device is known (for example, see Japanese Laid-open Patent Publication No. 2012-14674). In this technique, when an error message is output from the disk monitoring function through the confirmation of the communication results, communication with the router is confirmed through the hypervisor using a network monitoring function.