With the growing importance of WEB services and other information and communication services as social infrastructure, stable operation has become important of apparatuses that provide the services (hereinafter, referred to as service systems). The operation management of such service systems has been manually conducted by administrators heretofore. As the service systems increase in scale and complexity, however, the knowledge and operational burdens required of the administrators have increased dramatically, causing such problems as service suspension due to a decision error or operation error.
In an integrated operation management system for monitoring the states of hardware and software and controlling the same in a centralized fashion, event data (state notifications) acquired from a plurality of devices is automatically analyzed for combinations of abnormal states and the like so that macroscopic problems and causes can be estimated and notified to the administrator to support taking actions.
An example of the operation management apparatus according to the related art is described in Patent Document 1.
The operation management apparatus of the related art retains operation rule information that defines actions and the like corresponding to specified conditions with combinations of event data occurring in succession as the conditions. The event data is data that shows the state of a service system. If occurring event data satisfies a condition specified by operation rule information, the operation management apparatus performs actions and the like according to the operation rule information.
In this way, the operation management apparatus monitors and takes actions against abnormal states of the service system that are assumed in advance. The use of such an operation management apparatus reduces the operation burdens of the administrator significantly as compared to the case of manually monitoring and handling a large number of devices. Automatically taking actions and the like in accordance with the operation rule information allows constant monitoring and handling irrespective of the experience and skills of the administrator, and thus improves the quality of the operation management.
FIG. 25 is a block diagram showing the configuration of an operation management apparatus of the related art that is commonly known. The operation management apparatus 41 illustrated in FIG. 25 includes an event data acquisition section 2, an event accumulation section 7, an event analysis section 4, a decision condition accumulation section 3, a user interaction section 5, and a control section 6. The operation management apparatus 41 is connected to a service system 1. The operation management apparatus 41 and the service system 1 may be connected through a communication line or a communication network. The operation management apparatus 41 and the service system 1 may be configured as a single unit.
The service system 1 is an information processing apparatus or the like that provides information and communication services such as a WEB service and business service. For example, the service system 1 transmits a Web page to a terminal (not shown) in response to a request from the terminal, and executes business processing in response to a request from the terminal. The business services for the service system 1 to provide are not particularly limited.
The service system 1 includes an event data generating section 71 which generates event data. The event data is information that indicates the state of the service system 1. For example, the event data indicates the state of hardware equipped on the service system 1 or that of software implemented on the service system 1, the result of processing performed by the service system 1, and so on.
The event data generating section 71 monitors the states of hardware and software, and generates information indicating the states as event data. The event data generating section 71 transmits the generated event data to the operation management apparatus 41. The state of the service system 1 indicated by each individual piece of event data will be referred to as an event.
The event accumulation section 7 is a storage unit that stores event data that is successively generated according to changes of the operating state of the service system 1.
The event data acquisition section 2 receives the event data that is successively generated and transmitted by the event data generating section 71, and stores the received event data into the event accumulation section 7. When the event data acquisition section 2 receives new event data from the service system 1, the event data acquisition section 2 outputs the event data to the event analysis section 4.
The decision condition accumulation section 3 is a storage unit that stores operation rule information. The operation rule information defines conditions for extracting event data from a large number of pieces of event data.
The event analysis section 4 refers to the operation rule information to extract event data that matches a condition. The event data to be extracted is event data that the administrator would like to monitor. What event data to extract is defined by the administrator in advance. Examples of the event data to be extracted include event data that indicates a precursory state of a fault, event data that indicates the occurrence of a fault, and event data that indicates an item to be checked periodically. The conditions for extracting event data that is previously defined to be the event data to be extracted are determined by the administrator, and operation rule information that indicates the conditions is created by the administrator. The decision condition accumulation section 3 stores the operation rule information.
When event data is input from the event data acquisition section 2, the event analysis section 4 refers to the operation rule information stored in the decision condition accumulation section 3 and decides whether or not the event data matches the conditions defined by the operation rule information. The operation rule information sometimes describes processing to be performed when there is event data that matches a condition. Note that the operation rule information sometimes includes no description of such processing.
The event analysis section 4 outputs the result of analysis (the result of decision) to the user interaction section 5. Here, the event analysis section 4 also outputs to the user interaction section 5 the event data that is decided to match a condition and the operation rule information that is used for making the decision.
The user interaction section 5 includes input devices such as a keyboard and a mouse, and a display unit, for example. The user interaction section 5 displays information to the administrator of the service system 1, and accepts operations from the administrator.
For example, the user interaction section 5 displays the result of analysis from the event analysis section 4, the event data that is decided to match the condition, and the operation rule information that is used for making the decision. The user interaction section 5 modifies the operation rule information stored in the decision condition accumulation section 3 according to interactive inputs from the administrator. The user interaction section 5 also outputs a control instruction on the service system 1 to the control section 6 according to interactive inputs from the administrator.
The control section 6 controls the service system 1 based on the control instruction from the user interaction section 5.
Next, the event data will be described. FIG. 26 is an explanatory diagram showing an example of the event data that the conventional operation management apparatus 41 illustrated in FIG. 25 receives from the service system 1 and stores. FIG. 26B shows an example of the event data. FIG. 26A shows data (hereinafter, referred to as event list) that lists pieces of event data that are acquired in succession along with changes of state of the service system 1.
In the example shown in FIG. 26, each individual piece of event data includes: a number for uniquely identifying the event data acquired; information that indicates the type of the event data; the date and time of occurrence; information that indicates the source hardware or software; ID that indicates the format of the event data; a user name involved in the occurrence of the event; and a supplemental description of the event. The event data may also include other information. The event list 100 illustrated in FIG. 26 is information that lists the foregoing information (the pieces of information such as the number) extracted from each piece of event data. It should be noted that the event data need not include the number, whereas the following description deals with an example where the number is included for convenience's sake. The number for uniquely identifying event data is added to the event data that is received by the event acquisition data section 2.
The event data 101 illustrated in FIG. 26B shows an example of description of the event data that has number E9005 among the pieces of event data shown in the event list 110. The present example deals with the case where the event data 101 is a text file in which the names of attributes that indicate the state of the service system, such as TYPE and SERVER, and the values thereof (attribute values) are linked by equal signs (=).
The event list 100 is generated by interpreting the description of the names of the respective attributes and the values thereof included in the event data. More specifically, the event data acquisition section 2 can create an event list by extracting the values of the respective attribute values such as TYPE and SERVER from the received event data, establishing association between the values of the respective attribute values such as TYPE that are extracted from a single piece of event data, and adding the values to the event list 100 from one piece of event data to another.
In the example shown in FIG. 26, the type in the event list is created from the TYPE attribute of the event data. The date and time field in the event list is created from DATE1 (date) and TIME1 (time) of the event data. The source field in the event list is created from SERVER and SOURCE of the event data. Similarly, the user in the event list is created from USER of the event data.
In FIG. 26, the character string “Information” in the event list is derived from “INFORMATION” that is written as the TYPE attribute of the event data 101, and included into the event list. Character strings corresponding to the attribute values of the respective attributes included in such event data 101 may be defined in advance so that those character strings are included in the event list.
While FIG. 26 illustrates event data that indicates typical items for indicating the state of a computer, the event data may contain other information. The event data indicates the state of the service system 1 with the attributes and the values of the attributes in association with each other. The event data may be either data in a text format or data in a binary format.
The event data acquisition section 2 may store all the received event data into the event accumulation section 7 as event-specific files such as illustrated in FIG. 26B. Otherwise, the event data acquisition section 2 may create the event list of table form illustrated in FIG. 26A from all the event data received, and store the pieces of event data into the event accumulation section 7 in the form of the event list. The following description deals with an example where the event data acquisition section 2 stores the pieces of event data into the event accumulation section 7 in the form of an event list.
FIG. 27 is an explanatory diagram showing an example of the operation rule information and filter information that are stored in the decision condition accumulation section 3. FIG. 27A shows an example of the operation rule information. FIG. 27B shows an example of the filter information. The filter information is information that indicates the condition of event data for the event analysis section 4 to extract.
The condition under which the event analysis section 4 extracts event data is not necessarily be defined by only a single piece of filter information, but the condition is sometimes defined by a plurality of pieces of filter information. In the operation rule information, the condition under which the event analysis section 4 extracts event data is written as a single piece or a combination of a plurality of pieces of filter information. A single piece or a combination of a plurality of pieces of filter information that is/are written as the condition to extract event data in the operation rule information will be referred to as a combined condition.
A single piece or a combination of a plurality of pieces of filter information may be directly written in the operation rule information, whereas description will be given of an example where identification information on the filter information is written in the operation rule information as a combined condition. The filter information is created by the administrator in advance, and stored in the decision condition accumulation section 3.
Like the event data, the filter information includes attributes such as “SOURCE” and the values of the attributes. Note that attribute values need not necessarily be defined for all the attributes included in the filter information.
In filter information 202 illustrated in FIG. 27B, the attribute “SOURCE,” which indicates a piece of software, has an attribute value “BIZAP,” business software. The attribute “EVENTID,” which indicates the ID of the format of the event data, has an attribute value “8000.” The other attributes have no attribute value defined. If all the values of the attributes defined in the filter information match those of the attributes included in the event data, the event analysis section 4 decides that the event data satisfies the condition shown by the filter information.
As illustrated in FIG. 27A, each piece of operation rule information belonging to the group of operation rules (set of operation rule information) 200 includes: a number that identifies the operation rule information; a combined condition; the processing content to be performed when the event data is decided to match the condition; and a description of the operation rule.
For example, in the group of operation rules 200 illustrated in FIG. 27A, the operation rule information that is numbered R0120 shows that a command Mig($F0012.SOURCE,$SV(NOUPDATE)) will be executed as an action if the results of decision between the event data and the condition F0011 (see filter information 201 shown in FIG. 27) and the condition F0012 (see filter information 202) both are true, i.e., the event data matches both the conditions F0011 and F0012.
The operation rule information R0120 illustrated here provides a rule for detecting event data that is generated when business software on a computer included in the service system 1 causes a failure after automatic update of the software. The operation rule information R0120 also describes a command that moves the business software to another computer that is not updated.
The condition F0011 shown in the filter information 201 is that “the SOURCE attribute value in the event data is ‘updater’ and the EVENTID attribute value is 4001.” Consequently, when such event data is detected, the event analysis section 4 decides that the event data matches the condition F0011 (the decision on the condition F0011 is true).
An example of the event data that matches the condition F0011 is the event data E9002 shown in FIG. 26. The condition F0012 shown in FIG. 27 is that “SOURCE in the event data is ‘BIZAP’ and EVENTID is ‘8000.’” For example, the event data E9004 illustrated in FIG. 26 matches such a condition. Consequently, for example, when the event data E9004 shown in FIG. 26 is input from the event data acquisition section 2, the event analysis section 4 decides that the event data matches the condition F0012 (the decision on the condition F0012 is true).
The foregoing condition F0011 is a condition for extracting event data that indicates that software was updated. The condition F0012 is a condition for extracting event data that indicates that business software called BIZAP resulted in an error.
The operation rule information R0120 which has the conditions F0011 and F0012 as its combined condition (see FIG. 27A) shows the action to move (Mig( )) the business software described in the SOURCE attribute value of the event data that matches the condition F0012 ($F0012.SOURCE) to a computer having the NOUPDATE attribute value that indicates the absence of update ($SV(NOUPDATE)).
Here, a character string with a leading $ symbol represents a variable, which shows that the value of the information is determined by the actual event data or an additional processing function when the event analysis section 4 makes a decision on the operation rule information. The value determined by an additional processing function refers to a value that is determined by other than event data, such as “the current time.” Hereinafter, description will be given on the assumption that the $ symbol represents a variable.
Now, the operation rule information R0130 shows that a command MailTo(operator) will be executed as an action if the event data matches not-shown filter information F0013 for detecting a job failure. “MailTo(operator)” is a command to send a mail notification to the administrator.
As described above, the operation rule information shows actions to be taken when event data is decided to match combined conditions.
In FIG. 27A, the operation rule information is exemplified by general rules written in an if-then form. The operation rule information is not limited to such rules. For example, information may be extracted by using a typical structural analysis method such as regular expressions, instead of filter information written as combined conditions.
When the service system 1 is only subjected to state monitoring, the action field is sometimes omitted since the actions are limited to administrator notifications and screen display. Moreover, descriptions for promoting understanding are sometimes omitted if skilled administrators alone are intended. In any case, it is only essential that the event data showing the state of the service system 1 can be analyzed. That is, it is only necessary that whether or not the event data matches a condition can be decided to extract matching event data.
The event analysis section 4 includes an internal memory, and stores in the memory information (hereinafter, referred to as analysis state table) that associates the number for identifying each individual piece of filter information (filter number) with the number of operation rule information that describes the filter number as combined conditions.
If event data that matches filter information specified by a filter number is input to the event analysis section 4, the event analysis section 4 adds the number of the event data to the analysis state table in association with the filter information.
FIG. 28 shows examples of the analysis state table which the event analysis section 4 retains in the internal memory or the like. FIG. 28 shows an example of change of the analysis state table when the event data shown in FIG. 26 is successively input to the event analysis section 4 and the event analysis section 4 performs analyses (decision processing as to whether or not the event data matches the condition).
As shown in FIG. 28, the analysis state table lists the numbers of the respective pieces of filter information that are written as combined conditions in the group of operation rules 200 and the numbers of the pieces of operation rule information that describe the numbers of the pieces of filter information in association with the respective numbers of the pieces of filter information. If there is event data that matches filter information, the number of the event data is also added to the analysis state table in association with the filter information. Note that the group of operation rules 200 can be searched for filter information by each individual piece of operation rule information.
The analysis state table shown in FIG. 28, on the other hand, is information that lists all the numbers of the pieces of filter information, and is used to identify an operation rule from filter information. For example, referring to the row of the filter number F0011 in the analysis state table 301 shown in FIG. 28, the filter information is associated with the rule numbered R0120 in the group of operation rules 200 shown in FIG. 27.
This makes it possible to search for the operation rule information R0120 that includes the filter number F0011 in its combined condition. The row of the filter number F0011 in the analysis state table 301 shown in FIG. 28 also associates the event data E9002. This shows that the event data E9002 exists as event data that matches the filter information having the filter number F0011.
When the event data E9002 is input from the event data acquisition section 2, the event analysis section 4 performs matching with the filter information from the top of the table in order. That is, the event analysis section 4 decides whether or not the event data matches the filter information with respect to each filter number written in the analysis state table. If it is determined that the event data matches the filter information, the event analysis section 4 writes the number of the event data into the analysis state table in association with the number of the filter information that is decided to match. From the number of the event data written in the analysis state table, it is possible to identify the event data to be subjected to a decision whether or not to match the combined condition of the corresponding operation rule information.
FIG. 29 is a flowchart showing an example of the operation of the event analysis section 4 in the conventional operation management apparatus illustrated in FIG. 25. Hereinafter, the operation of the conventional operation management apparatus will be described with reference to FIGS. 25 to 29.
A group of operation rules and filter information are input to the user interaction section 5 by the administrator in advance. The user interaction section 5 stores the group of operation rules and the filter information into the decision condition accumulation section 3. Here, description will be given of an example where the group of operation rules 200 shown in FIG. 27A and a set of filter information including the filter information 201 and 202 shown in FIG. 27B are stored in the decision condition accumulation section 3.
The event data generating section 71 of the service system 1 detects the operating state of the service system 1 and generates event data successively, and transmits the event data to the operation management apparatus 41. Receiving the event data from the service system 1, the event data acquisition section 2 stores the event data into the event accumulation section 7 and outputs the event data to the event analysis section 4.
The event analysis section 4 accepts event data from the event data acquisition section 2. If there is no event data input, the event analysis section 4 waits for the input of event data (step S701).
If the event data is input from the event data acquisition section 2 (Yes at step S701), the event analysis section 4 decides whether or not filter information that is described as a combined condition by operation rule information included in the group of operation rules 200 matches the event data, for example, by the following way.
The event analysis section 4 refers to the analysis state table illustrated in FIG. 28 to decide the presence or absence of filter information that has not been decided whether or not to match the event data among the pieces of filter information whose filter numbers are listed in the analysis state table (step S702). If there is any filter information that has not been decided whether or not to match the event data (Yes at step S702), the event analysis section 4 decides whether or not one of the pieces of filter information and the event data match (step S703).
If the result of decision at step S703 is that the filter information and the event data do not match (No at step S704), the event analysis section 4 proceeds to step S702 to repeat the operation of step S702 and subsequent steps. If there is no filter information that has not been decided whether or not to match the event data (No at step S702), the event analysis section 4 proceeds to step S701 to wait for the input of new event data.
If the result of decision at step S703 is that the filter information and the event data match (Yes at step S704), the event analysis section 4 records the number of the input event data in the analysis state table in association with the filter number of the filter information that is decided to match (step S705).
For example, when the event data E9002 shown in FIG. 26A is input from the event data acquisition section 2 and the event data E9002 is decided to match the filter information 201 which is specified by the filter number F0011, the event analysis section 4 records E9002 in the analysis state table in association with the filter number F0011 (step S705). The analysis state table 301 shown in FIG. 28 shows such a state.
The event analysis section 4 further identifies the operation rule information from the filter number of the filter information that matches the event data, and decides whether or not the event data satisfies the combined condition of the operation rule information (step S706).
If the result of decision at step S706 is that the event data does not satisfy the combined condition (No at step S707), the event analysis section 4 proceeds to step S701 to wait for the input of new event data. On the other hand, if the result of decision at step S706 is that the event data satisfies the combined condition (Yes at step S707), the event analysis section 4 performs the action specified by the operation rule information that is identified at step S706 (step S708). The number of the event data input to the event analysis section 4 is then deleted from the analysis state table (step S709).
For instance, when the processing of step S705 is performed as in the foregoing example, the event analysis section 4 identifies the rule number R0120 corresponding to the filter number F0011 from the analysis state table 301, and decides whether or not the event data satisfies the combined condition of the operation rule information “F0011 AND F0012” (step S706).
Referring to the analysis state table 301 where the event data E9002 is input to the event analysis section 4, F0011 is true since there is recorded the corresponding event data, but F0012 is false since there is recorded no corresponding event data. The combined condition of the operation rule information R0120 is thus false. That is, the event data E9002 does not satisfy the combined condition of the operation rule information R0120 (No at step S707). The event analysis section 4 therefore proceeds to step S701 to wait for the input of new event data.
After the foregoing operation illustrated, if the event data E9003 (see FIG. 26) is input to the event analysis section 4 (Yes at step S701), the event analysis section 4 identifies the filter information that matches the event data E9003 (steps S702 to S704). Suppose here that the event data E9003 and the filter information having the filter number F0010 match. Then, the event analysis section 4 records E9003 in association with F0010 (step S705).
Subsequently, the event analysis section 4 identifies the operation rule information R0110 corresponding to F0010, and decides whether or not the event data E9003 satisfies the combined condition of the operation rule information (step S706). Since the combined condition of the operation rule information R0110 includes F0010 alone (see FIG. 27A), the event data E9003 satisfies the combined condition which consists of F0010 (Yes at step S707).
It follows that the event analysis section 4 performs the action of the operation rule information (step S708), whereas no processing will be performed since the operation rule information R0110 includes no description of corresponding actions (see FIG. 27A).
Since the decision on the condition of the operation rule information is completed, the event analysis section 4 then deletes the corresponding event (here, E9003) from the analysis state table (step S709), and proceeds to step S701. The analysis state table at the point in time is the same as the analysis state table 301 shown in FIG. 28.
Now, the event data E9004 is similarly input to the event analysis section 4. If it is decided that the event data and the filter information having the filter number F0012 match (steps S701 to S704), the event analysis section 4 records E9004 in the analysis state table as a corresponding event (step S705). That is, the number E9004 of the event data is stored in association with F0012.
This transforms the analysis state table 301 into the analysis state table 302 shown in FIG. 28. The event analysis section 4 also identifies the operation rule information R0120 corresponding to F0012, and decides whether or not the event data E9003 satisfies the combined condition of the operation rule information (step S706). The decision on the condition here results in true (Yes at step 707) since F0011 and F0012 both have corresponding events in the analysis state table 302 (see FIG. 28).
The event analysis section 4 therefore performs an action on the service systems 1 through the user interaction section 5 and the control section 6 (step S708), and deletes the corresponding events (E9002 and E9004) corresponding to the operation rule information R0120 (step S709). This transforms the analysis state table 302 into the analysis state table 303 shown in FIG. 28.
The event data E9005 shown in FIG. 26A is event data that indicates the result of the action performed thus, and matches the combined condition of the operation rule information R0110.
The event analysis section 4 may create a list that includes pieces of event data that match the combined conditions of the operation rule information among the event data, and store the list in the event accumulation section 1. In such mode, the list is created in the same format as that of the event list shown in FIG. 26A.
When a list of pieces of event data that match the combined conditions of the operation rule information is thus created, the event analysis section 4 presents the list to the administrator through the interaction section 5. The administrator can refer to the event data E9005 in the list to know that the failure has been automatically handled.
In cases such as when a failure is found against which no action is specified by the operation rule information, it is possible to give a control instruction to the control section 6 through the user interaction section 5 and to manually handle the failure. Note that all the event data received by the event data acquisition section 2 is stored in the event accumulation section 1 in the form of respective files or an event list. Such event data (event list) can be displayed on the user interaction section 5 so that the administrator can check the detailed information.
As described above, the operation management apparatus of the related art shown in FIG. 25 specifies in the operation rule information the combined conditions for extracting successively-occurring event data, so that it is possible to monitor and take actions against abnormal states of the service system 1 that are assumed in advance. This reduces the operation burdens of the administrator significantly as compared to the case of manually monitoring and handling a large number of devices. The automation also allows constant monitoring and handling irrespective of the experience and skills of the administrator, thereby improving the quality of the operation management.
Patent Document 2 describes a performance monitoring method that includes the step of predicting the possibility of occurrence of a future fault in an information processing system.
Patent Document 1: JP-A-2006-244404
Patent Document 2: JP-A-2005-327261 (paragraph 0009)