When a storage device fails, the cause of the failure is not often easily understood. When a storage device fails, it can record an error condition, including information regarding hardware errors, recoverable errors, and other environmental data. The storage device then notifies the system that it is connected to of the error, and the system logs the error in a general system log. The error is logged at the time the error occurs.
A general system log is a file that contains a history of everything that happens on the system. The logging functionality runs in the background (i.e., it is always running) and is used by the operating system and the applications and services available on the system to record information. The log's location can be determined by a system administrator, but the log is generally stored in a location that is accessible by all of the components of the system, such as on a centrally located host.
A log entry is generated for each individual event, including system logins and failures reported by different hardware and software. Because the system log stores information about all components of the system, the log file can become large rather quickly. The problem with the general system log is that it, by its definition, provides a history of everything that has happened in the system. But the system log is not concise, such that finding information related to a single failed disk, for example, can be difficult.
A problem arises in that the general system log contains a large amount of information about events occurring throughout the system, not just about storage device-related errors. To be able to determine a reason why a storage device failed, the log needs to be review to locate all of the information about the failed storage device. This problem becomes more pronounced as the number of storage devices in the system increases, because the general system log will become larger. It then becomes more difficult to find all of the information relating to a single storage device in the log, since the information will be sprinkled throughout the log in various places.
For example, if a storage device generated errors periodically (as opposed to several errors all at the same time), the log would have to be reviewed over a potentially large period of time to find all of the errors relating to a single storage device. Furthermore, because different types of errors can be related to the failure of a single storage device, a person reviewing the log needs to have knowledge of the storage device, how the storage device is connected to the storage system, and where in the log to look for all of the information relevant to the storage device. This is a manual process that is time-consuming and there is a possibility that the person reviewing a log may miss a piece of information that is important in analyzing why the storage device failed.
If detailed information on the history of the storage device was available and the storage device has stopped communicating with the system, the history information can be examined to help determine why the storage device failed. The information can help summarize why the storage device failed and provide a conclusive reason as to why the storage device is currently inaccessible. For example, there may have been a specific error encountered by the storage device that caused it to fail or there may have been a series of errors over time that indicated that the device would fail soon.
Existing approaches return pages of error messages and status messages, and it is left to a storage system administrator to determine a reason for the storage device failure. There is therefore a need to collect all of the information relevant to a storage device failure in one location for easier analysis of the reason for the failure and reporting this information to a storage system administrator or other user with appropriate privileges.