Recently, a system of an organization having an information processing system and presenting various services to an end user has been introduced. In a computer system of the organization, a manager sequentially performs managing operations according to an instruction manual. The instruction manual is a detailed operation manual (document) generated in individual operation work item units, and includes a patch application instruction manual, a backup instruction manual, a monitor instruction manual, etc.
However, if the description of an instruction manual is unclear, the contents of the processes performed in a managing operation may depend on the operation manager, thereby causing unstable quality reference of an operation, or causing variance of operation contents and quality depending on the operator. However, it is hard to detect an unclearly written part. Although an unclearly written part is to be detected in an instruction manual by a user reviewing the descriptions of the instruction manual, the user often overlooks the part.
In the operation management of an IT (information technology) architect, there is a regulation that an instruction manual without operation rules is not to be used.
FIGS. 1 and 2 are explanatory views of an example of the operation managing method of a system using an instruction manual.
In this example, the instruction manual is, for example, a detailed operation manual (operation instruction manual) generated in individual operation work item units such as a monitor instruction manual, an online operation instruction manual, a backup instruction manual, etc. Generally, in a development phase, an infrastructural team (basic operation team) works as a leader in generating the documents, and in a preparation stage of an operation maintenance phase, a leader group for the maintenance operation inherits the documents.
Before inheriting the instruction manual, documents are described mainly for each individual operation. Practically, a managing operation group adds and checks the operation flow based on the operation rule for an operation system when an operation is performed, an operation approver, an operation schedule, a report format, an contingency method, a relation to other operation work, an operation quality index, etc. These items are described as upper documents of an operation instruction manual, a link is prepared for each instruction manual from the operation flow, or an operation flow etc. to be added to the instruction manual is described.
When a managing operation is performed using an instruction manual in which an operation rule is not described, a receiving method of an operation and a quality standard is shifted, and an information and notification rule applied when an operation is terminated is unclear, thereby varying the contents and quality of the operation depending on an operator.
For example, as illustrated in FIG. 2, in a monitoring operation, assume that there is an instruction manual describing “when a monitor node is normally added to a monitor tool, the next operation is performed . . . ”. In this case, since the confirmation reference of the normal termination of a node adding operation is unclear, there arises the variance in the determination of a person in charge whether it is accepted if a node is added on a GUI (graphic user interface), whether it is accepted if there is no error occurring after the confirmation of a log file, or whether it is accepted if an error message is normally transmitted and displayed.
In addition, there may not be the descriptions in the instruction manual about the contents as to who determines (or approves) the normal termination and passes control to the next operation, where the trail is described, to whom an abnormal state is to be escalated, etc. Depending on the operation fields, the operation rule may be stored in the memory of a specific operation manager, and the manager may answer any question. A person having a long field experience may store the rule as an implicit rule. However, the situation above promotes the personal dependency and may cause the interference with the transparency of the operation. Therefore, it is preferable that the operation rule is stipulated.
In the example above, if the operation rule regulates the checking reference for normality confirmation of an operation (by confirmation with two patterns for log confirmation, input of a status command, etc.), the execution system of the operation (the operation to be performed necessarily by two persons, etc.), the destination of a report of the completion of an operation and a termination reference, the format and the storage of the operation trail, etc., a “managing operation instruction manual” whose operation flow includes the contents above can be completed. In addition, stipulating the operation flow based on the common rules not only enables the operation quality to be guaranteed, but also allows the transparency of the operations to be guaranteed and the information to be shared among the operations.
The basic structure for generating the operation rules described above is an operation management policy determined by a related organization. As a corporation rule, a common rule in an operation maintenance phase (a person capable of changing a system and approving a release is limited to a person having the right of a field manager or a person in a higher position), a process to be necessarily performed, etc. are regulated.
Then, operation instructions are generated based on the operation rules. Thus, the individual managing operation flow can be clearly regulated. If the operation management instructions are checked in the operation field based on the operation policy, the entire operation can be standardized.
Assume that a Web 3 hierarchical system (Web server+application server+DB server) provides a Web mail service. In this case, it is assumed that the system is operation-managed. In the operation management, when a fault occurs, a service stops and the performance is degraded (low response), thereby inflicting a loss upon a client. Therefore, it is requested to suppress a fault or a prompt recovery from a fault if it occurs.
A fault often occurs by a mistake of an operation manager. For example, it occurs by performing an erroneous operation by applying a patch to a Web server etc. Practically, a patch application instruction is often prepared as an instruction manual, and an operator performs an operation while referring to the instruction manual. Although the instructions are entirely described by a script and automated, complete automation cannot be practically realized. However, when there are errors in the instructions or the instruction manual, or they are unclearly written, the operation manager can perform an erroneous operation by misunderstanding the use.
Assume that the instruction manual includes the instruction to temporarily stop the Web server, performs maintenance and inspection, and then restarts the server. Also assume that an operation manager A shuts down the Web server, inspects it, and activates it. An operation manager B changes the settings (temporarily deletes the current settings from the allocation list) of a load balancer, shuts down the Web server, inspects it, activates it, and changes the settings (adds new settings to the allocation list) of the load balancer.
In this case, the operation manager A forgets the setting change of the load balancer, and therefore a request is allocated to the Web server. As a result, a fault for which the request cannot be processed occurs. Since it is clear that the setting change is made to the load balancer when the Web server is stopped, it is not described. However, when a person not familiar with the system configuration performs the instruction, a fault occurs awkwardly.