1. Field of the Invention
This invention relates to error event tracking and handling in a computer system and more particularly relates to an apparatus, system, and method for facilitating monitoring and responding to error events.
2. Description of the Related Art
Computer and information technology continues to progress and grow in its capabilities and complexity. In particular, data storage systems continue to evolve to meet the increasing demands for reliability, availability, and serviceability of the physical data storage system and its hardware, software, and various other components. Data storage systems often handle mission critical data. Consequently, data storage systems are expected to remain on-line and available according to a 24/7/365 schedule. Furthermore, data storage systems are expected to handle power and service outages, hardware and software failures, and even routine system maintenance without significantly compromising the reliability and availability to handle data Input/Output (I/O) from hosts.
Generally, hardware components used in computer systems, such as storage systems, are inexpensive but prone to unpredictable and/or sporadic failure referred to as an error event. The error event may result in loss of system availability, data integrity, or both. To maintain high availability, even in view of hardware failures, modern computer systems include redundant hardware components to provide either more capabilities and/or redundant capabilities. These hardware components are often referred to as Field Replaceable Units (FRUs).
Error recovery techniques have been developed to recover lost or corrupted data and/or restore the hardware experiencing the error event to a reliable operating state. As a single hardware component continues to experience error events, error recovery modules may mark the hardware for replacement during a subsequent service call for the computer system.
Consequently, in conjunction with error recovery functions, error recovery software has tracked error events using simple counters that increased when each error event occurred. These simple counters provide some general information regarding the reliability of a hardware component. For example, a high value for the simple counter can indicate a general failure of the hardware component. Generally, these counters begin recording error events once the hardware component is powered and online. The counters continue to register error events until the hardware component is reset, power cycled, or fenced meaning the hardware component is taken offline either logically or physically, or replaced. When a counter satisfies a threshold, error recovery is initiated on that hardware component. The counters include no information regarding frequency for specific time periods between when the counter starts monitoring and when the counter is reset, just a total error event count.
Unfortunately, simple counters do not provide sufficient information to accurately determine cause and effect for error events. For example, a host adapter may signal an error event due to data corruption because of a faulty processor which signals its own error event. However, examining the error event counters for the processor and the host adapter may not indicate such a relationship. In addition, simple counters are generally stored in non-persistent memory such that power cycling the memory effectively resets the counters. Consequently, error event relationships between power cycles are undetectable.
In addition, simple counters are traditionally maintained as data within software code implementing error recovery. Consequently, adding or removing counters related to new error events requires modification of the error recovery code and vice versa. Furthermore, traditional counter thresholds are hard coded and not configurable by an end-user. Typically, this is due to hardware warranty concerns as well as a concern that the end-user does not have an appreciation of the proper level for a particular counter threshold.
The static nature and lack of relationships between traditional counters to each other or to specific points in time, severely limits the ability of traditional counters to indicate correlations between error events on different hardware components and/or a single hardware component over time. Relating error events to time may permit reasonable explanations for counters exceeding thresholds. For example, maintenance may have been performed when the counter threshold was crossed. Consequently, the error event may be human caused rather than actual hardware failure.
From the foregoing discussion, it should be apparent that a need exists for an apparatus, system, and method that facilitates monitoring and responding to error events. Beneficially, such an apparatus, system, and method would count error events in relation to time, count a plurality of error events in relation to a single hardware component, provide different threshold levels that trigger different degrees of error recovery, preserves certain counters between resets and system failures, permits end-users to adjust threshold levels indirectly based on predefined historical and empirical information, and permits counting of error events in relation to a sliding window of time.