1. Field of the Invention
An exemplary embodiment relates to an automated management system and method for fault events of a data center, and more particularly, to a system for automatically managing a fault event occurring in a data center and a method thereof.
2. Discussion of Related Art
With development of the cloud environment in recent years, an effective operation and management of a data center has been emerging as an important issue. Most conventional methods of managing a data center depend on the previous operation record, and faults of the data center are manually managed. In this case, the prediction of occurrence of fault events and automated rapid response and action are impossible. In addition, it is also difficult to have a proactive measure and prediction on a new type of fault events that have not previously occurred.
Representative examples of the existing fault management system include a self monitoring analysis and reporting technology (S.M.A.R.T) suggested by IBM. S.M.A.R.T monitors an abnormal operation of a storage, and tracks the cause of occurred errors, and provides predictions on faults to occur in the future. In this case, predictive failure analysis (PFA) technology is used. PFA tracks an abnormal operation of equipment (including monitoring a normal operation) and relevant potential causes of errors that have occurred, by using machine learning and mathematical modeling based on the past data of disks.
S.M.A.R.T has the following drawbacks and limitations.
First, the subject to monitoring is limited to hard disk drivers. Accordingly, the monitoring is restricted by attributes of the hard disks, for example, read error rate and reallocated sectors count.
Second, even though a prediction model is produced using machine learning and mathematical modeling schemes based on various types of data generated from the hard disk, the prediction model is applied only to erroneous operation patterns that have been internally generated up to the present. Accordingly, S.M.A.R.T is configured to operate based on a threshold derived from the past performance record data. Meanwhile, it is reported that 50% of erroneous hard disk failures occur without an alarm notification on a system.
Third, S.M.A.R.T has a limitation in that a prediction and response are not made on a potential fault event, such as a system down that may occur due to a conflict between versions of software (for example, system software, and middleware application).
Fourth, message notifications by S.M.A.R.T in monitoring are provided in only two types (‘Device is OK’ or ‘Drive is likely to fail soon’).
FIG. 1 shows the entire configuration of PFA used in S.M.A.R.T.
The above described conventional technology adopts a scheme of performing monitoring with respect to internally generated data, based on an analysis model, obtained from machine learning and mathematical modeling, and a threshold regarded as a normal operation, thereby having a difficulty in predicting and responding to a fault event, such as a system error, that is internally unexpected. In addition, the conventional technology only has a certain device (e.g., hard disks) as subject to the monitoring, so it is impossible to offer responding plans to various types of fault events that may occur due to software being installed in a system and operated. In addition, the conventional technology is provided to mainly perform monitoring, which leads to a great number of erroneous detections of failures, causing a system operator to have unnecessary tasks.