1. Technical Field
The present invention relates to a trouble analysis apparatus of a computer system. Particularly, the present invention relates to a trouble analysis apparatus for determining a cause of a trouble when the computer system has trouble.
Priority is claimed on Japanese Patent Application No. 2008-201272, filed Aug. 4, 2010, the content of which is incorporated herein by reference.
2. Background Art
Today, many computer systems are constituted from hardware and software. In addition, such hardware and software are respectively constituted from elements (hereinafter, “functional elements”) that respectively implement functions. In other words, each apparatus including software constitutes multiple layers from hardware to software. The functional elements belonging to each layer cooperate each other and implement functions as one apparatus (an apparatus with multiple layers including software is called a “multilayer apparatus”). In addition, a system constituted from multiple multilayer apparatuses (hereinafter, a “multilayer system”) is generally used.
FIG. 3 is a drawing for explaining a multilayer system 1 which is an example of the multilayer system and which is a type of a client-server system. The multilayer 1 is constituted from a server 10, a client 2 and a switch 30 as shown in FIG. 10. The server 10, the client 20 and the switch 30 are examples of multilayer apparatuses.
The server 10 is constituted from a computer hardware 100, an operating system 110 and a server application 120. In other words, the server 10 is constituted from three layers including a hardware layer implemented by the computer hardware 100, an operation layer implemented by the operating system 110 and an application layer implemented by the server application 120.
The client 20 is constituted from a computer hardware 200, an operating system 210 and a client application 220. In other words, the client 20 is constituted from three layers including a hardware layer implemented by the computer hardware 200, an operation layer implemented by the operating system 210 and an application layer implemented by the client application 220.
The switch 30 is constituted from a switch hardware 300, an operating system 310 and a switch application 320. In other words, the switch 30 is constituted from three layers including a hardware layer implemented by the switch hardware 300, an operation layer implemented by the operating system 310 and an application layer implemented by the switch application 320.
The computer hardware 100 of the server 10 is constituted from a network card 101, a HDD 102, a CPU 103, a main memory 104 and a trouble monitoring portion 109. The network card 101, the HDD 102, the CPU 103 and the main memory 104 are functional elements that belong to the computer hardware 100. It should be noted that the trouble monitoring portion 109 monitors the network card 101, the HDD 102, the CPU 103 and the main memory 104, and when the trouble monitoring portion 109 detects trouble in such functional elements, a trouble notification (error notification information) is transmitted to a trouble analysis apparatus 40 (FIG. 11, explained below).
The operating system 110 of the server 10 is constituted from a network driver 111, a HDD 112, a network protocol 113, a memory management portion 114 and a trouble monitoring portion 119. The network driver 111, the HDD 112, the network protocol 113 and the memory management portion 114 are functional elements belonging to the operating system 110 of the server 10. It should be noted that the trouble monitoring portion 119 monitors the network driver 111, the HDD 112, the network protocol 113 and the memory management portion 114 and transmits a trouble notification to the trouble analysis apparatus 40 when detecting trouble regarding such functional elements. Further, the network protocol 113 is a processing portion for processing, for example, TCP/IP operations, or is a management portion (management program) of the processing portion. A network protocol 213 (shown below) is the same as the network protocol 113.
The server application 120 of the server 10 is constituted from an application processing portion 120 and a trouble monitoring portion 129. The application processing portion 121 is a functional element belonging to the server application 120 of the server 10. It should be noted that the trouble monitoring portion 129 monitors the application processing portion 121 and transmits a trouble notification to the trouble analysis apparatus 40 when detecting trouble.
The computer hardware 200 of the client 20 is constituted from a network card 201, a HDD 202, a CPU 203, a main memory 204 and a trouble monitoring portion 209. The network card 201, the HDD 202, the CPU 203 and the main memory 204 are functional elements belonging to the computer hardware 200 of the client 20. It should be noted that the trouble monitoring portion 209 monitors the network card 201, the HDD 202, the CPU 203 and the main memory 204 and transmits a trouble notification to the trouble analysis apparatus 40 when detecting a trouble regarding such functional elements.
The operating system 210 of the client 20 is constituted from a network driver 211, a HDD driver 212, a network protocol 213, a memory management portion 214 and a trouble monitoring portion 219. The network driver 211, the HDD driver 212, the network protocol 213 and the memory management portion 214 are functional elements belonging to the operating system 210 of the client 20. It should be noted that the trouble monitoring portion 219 monitors the network driver 211, the HDD driver 212, the network protocol 213 and the memory management portion 214 and transmits a trouble notification to the trouble analysis apparatus 40 when detecting trouble regarding such functional elements.
The client application 220 of the client 20 is constituted from an application processing portion 221 and a trouble monitoring portion 229. The application processing portion 221 is a functional element belonging to the client application 220 of the client 20. It should be noted that the trouble monitoring portion 229 monitors the application processing portion 221 and transmits a trouble notification to the trouble analysis apparatus 40 to the trouble analysis apparatus 40 when detecting trouble.
The switch hardware 300 of the switch 30 is constituted from network interfaces (NWI/F) 301-303, a switch fabric 304, a CPU 305, a memory 306 and a trouble monitoring portion 309. The network interfaces 301-303, the switch fabric 304, the CPU 305 and the memory 306 are functional elements belonging to the switch hardware 300 of the switch 30. It should be noted that the trouble monitoring portion 309 monitors the network interfaces 301-303, the switch fabric 304, the CPU 305 and the memory 306 and transmits a trouble notification to the trouble analysis apparatus 40 when detecting trouble regarding such functional elements.
The operating system 310 of the switch 30 is constituted from a switch driver 311, a memory management portion 312 and a trouble monitoring portion 319. The switch driver 311 and the memory management portion 312 are functional elements belonging to the operating system 310 of the switch 30. It should be noted that the trouble monitoring portion 319 monitors the switch driver 311 and the memory management portion 312 and transmits a trouble notification to the trouble analysis apparatus 40 when detecting trouble regarding such functional elements.
The switch application 320 of the switch 30 is constituted from a routing protocol 321 and a trouble monitoring portion 329. The routing protocol 321 is a functional element belonging to the switch application 320 of the switch 30. It should be noted that the trouble monitoring portion 329 monitors the routing protocol 321 and transmits a trouble notification to the trouble analysis apparatus 40 when detecting trouble. Further, the routing protocol 321 is a processing portion (a computer program which is designed to implement the RIP2 operations) for processing, for example, RIP2 operations or is a management portion (management program) of the processing portion.
A cable 2 connects the network card 101 of the server 10 with the network interface 301 of the switch 30. A cable 3 connects the network card 201 of the client 20 with the network interface 303 of the switch 30. In addition, a system configuration management apparatus 60 is connected to one end of a cable 4 which is connected to the network interface 302 of the switch 30. It should be noted that the system configuration management apparatus 60 conducts management operations of updating the system configuration of the multilayer system 1. Therefore, for example, the cable 4 is used for transmitting and receiving information, such as system setting information transmitted from the system configuration management apparatus 60.
As described above, in the multilayer system 1, the functional elements belonging to layers of the server 10, the client 20 and the switch 30 conducts operations by cooperating each other. In addition, when there is trouble in one functional element of the multilayer system 1, the trouble spreads on related functional elements, and each of the trouble monitoring portions transmits the trouble notification to the trouble analysis apparatus 40.
FIG. 11 is a drawing for explaining a trouble analysis apparatus 40 which is an example of a conventional analysis apparatus. FIG. 12 is a drawing showing an example of information stored in a trouble analysis table 403 included in the trouble analysis apparatus 40. The trouble analysis apparatus 40 includes a trouble collection portion 401, a trouble searching portion 402, a trouble analysis table 403 and a trouble notification portion 404. The trouble collection portion 401 collects trouble notifications from the multilayer system 1. As shown in FIG. 12, the trouble analysis table 403 is constituted from entries including multiple trouble notification sources 1, 2, 3, . . . , N and presumable specific trouble spots 1 and 2. For example, TABLE NO. 1 shows that when trouble notifications are received from both the network interface 301 and the network card 101, a presumed point of the trouble is the cable 2 or the switch hardware 300. The search portion 402 conducts a trouble spot presuming search operation in reference to the trouble analysis table 403 based on the collected trouble notifications. The trouble notification portion 404 notifies both the trouble notification and presumed-specific-trouble-point information that shows a trouble source functional element (functional element which is a source of the trouble) searched for by the search portion 402. In other words, regarding the trouble analysis apparatus 40, operators set the trouble analysis table 403 beforehand based on past experiences (inputting presumed results of the trouble spots regarding multiple elements of troubles), and the trouble spot is shown by using the trouble analysis table 403 when trouble occurs.
The conventional trouble analysis method is a method in which an operator recognizes a trouble notification and determines a trouble spot based on his experiences, or in which, as described in the trouble analysis apparatus 40 above, an operator defines specific trouble spots corresponding to multiple trouble events beforehand so as to improve efficiency of operations of detecting trouble spots.
FIG. 13 is a flowchart showing an example of operations of the trouble analysis apparatus 40 which is an example of a conventional analysis apparatus. FIG. 14 is a flowchart showing operations conducted by the operator after notification by the trouble analysis apparatus 40 regarding a trouble spot. FIG. 14 shows a flowchart which shows an operation flow from a system operation step to detection of a trouble spot after occasion of trouble.
First, to start the trouble analysis apparatus 40, a setting operation on the trouble analysis table 403 of the trouble monitoring apparatus 40 is conducted by the operator so as to conform to a system. In other words, the trouble analysis table 403 of the trouble monitoring apparatus 40 is generated based on input by the operator (step S101). After this, the trouble analysis apparatus is in an operating status.
While the trouble analysis apparatus 40 is in an operating state, if a system configuration is changed by using the system configuration management apparatus 60, operation of setting the trouble analysis table 403 of step S101 is conducted again.
While the trouble analysis apparatus 40 is in an operating state, if trouble occurs in the multilayer system 1, the trouble is spread between the functional elements, and multiple troubles are detected through trouble monitoring operations of the multilayer system 1. After this, multiple trouble notifications are transmitted from the switch 30 to the trouble analysis apparatus 40. In other words, the trouble collection portion 401 collects (receives) multiple trouble notifications from the switch 30 (step S102).
The search portion 402 determines the trouble spot in reference to the trouble notifications and the trouble analysis table 403. The trouble notification portion 404 of the trouble analysis apparatus notifies the operator of both the trouble notifications and the presumed-specific-trouble-point information (step S106).
The trouble notification portion 404 of the trouble analysis apparatus 40 shows the operator both the trouble notifications and the presumed-specific-trouble-point information. As shown in FIG. 14, if the trouble spot notified by using the presumed-specific trouble-point information is true (appropriate), the operator conducts recovery operations of the corresponding trouble spot (operation 1). However, if the trouble spot notified by using the presumed-specific trouble-point information is not true, the operator conducts checking operations on the trouble spots regarding all trouble notifications (operation 2). First, the operator conducts a trouble spot determination operation (operation 4) by conducting checking operations (operation 3) on the hardware, on the software and inside of each layer.
If the trouble spot determined in a step of the operation 4, the operator conducts recovering operations on the trouble spot (operation 5). However, if the trouble spot determined in a step of the operation 4 is not true, the operator conducts checking operations (operation 6) in consideration of relationship between layers, changes the layer on which the checking operations are conducted and conducts checking operations (operation 3) again on the changed layer. In accordance with such operations, the operator repeatedly conducts checking operations on the inside of the layers and checking operations based on relationship between layers, narrows the range of suspicious trouble spots and determines the trouble spot (operation 4).
It should be noted that the patent document 1 describes an apparatus which presumes a range of affected area by trouble on a system constituted from multiple multilayer apparatuses. In addition, the patent document 2 describes an analysis technique which, when a system trouble is raised, analyzes the range of the affected area of the system trouble.    [Patent Document 1] Japanese Patent Application, First Publication No. 2000-069003    [Patent Document 2] Japanese Patent Application, First Publication No. 2005-258501