It is well known in the art for computers to encounter faulty hardware and/or software during storage and retrieval of data. For example, an error may arise when the computer unexpectedly encounters a breakdown in hardware, e.g. in magnetic media (such as a hard disk) where the data is stored. In addition to faulty hardware, errors can also arise due to bugs in software, e.g. an application program may overwrite data of another application program or an application program may improperly use an interface (API) of the underlying operating system to cause wrong data to be stored and/or retrieved. These faults are called data corruptions. Therefore, a fault can arise during normal operation in any component of a system. Examples of components are network interface circuitry, disks, operating system, application programs, cache, device driver, storage controller, etc.
Some application programs, such as database management systems (DBMS), may generate errors when data corruptions are detected, e.g. if a previously-stored checksum does not match a newly-calculated checksum. A single fault (also called “root” cause) can result in multiple failures with different symptoms; moreover, a single symptom can correspond to multiple failures. Knowing a symptom or a root cause of a failure is sometimes not enough for a human to formulate one or more recommendations to repair the failed hardware, software or data.
Manually reviewing such errors (by a system administrator) and identifying one or more faults which caused them to be generated can become a complex and time-consuming task, depending on the type and number of errors and faults. Specifically, the task is complicated by the fact that some errors are not generated immediately when a fault occurs, e.g. a fault may cause corrupted data to be stored to disk and even backed up, with errors due to the fault being generated a long time later, when the data is read back from disk. Furthermore, errors due to a single fault do not necessarily appear successively, one after another. Sometimes errors due to multiple faults that occur concurrently are interspersed among one another, which increases the task's complexity. Also, information about some faults is interspersed among different types of information, such as error messages, alarms, trace files and dumps, failed health checks etc. Evaluating such information and correlating them is a difficult task that is commonly done manually in prior art, which is error prone and time consuming. Error correlation can be done automatically instead of manually. Systems for automatic error correlation are commonly referred to as “event correlation systems” (see an article entitled “A Survey of Event Correlation Techniques and Related Topics” by Michael Tiffany, published on 3 May 2002). However, such systems require a user to manually specify correlation rules that capture relationships between errors. Such rules applied to data storage systems that generate many types of errors under many different failure scenarios can be very complex. They are also often based on a temporal ordering of errors that might not be correctly reported by a data storage system. This makes such systems prone to generating wrong results, false positives and false negatives. Moreover, any new error type added to the system or any new failure scenario require reconsideration of the correlation rules that makes them difficult to maintain and, therefore, even less reliable. Finally, an error correlation system is intended to find a “root cause” fault that could be different from the data failure because it does not indicate which data is corrupted and to which extent.
Moreover, even after a fault has been identified correctly by a system administrator, repairing and/or recovering data manually requires a high degree of training and experience in using various complex tools that are specific to the application program. For example, a tool called “recovery manager” (RMAN) can be used by a database administrator to perform backup and recovery operations for the database management system Oracle 10g. Even though such tools are available, human users do not have sufficient experience in using the tools because data faults do not occur often. Moreover, user manuals and training materials for such tools usually focus on one-at-a-time repair of each specific problem, although the user is typically faced with a number of such problems. Also, there is often a high penalty paid by the user for making poor decisions as to which problem to address first and which tool to use, in terms of increased downtime of the application program's availability, and data loss. To sum up, fault identification and repair of data in the prior art can be one of the most daunting, stressful and error-prone tasks when performed manually.