Host processor systems may store and retrieve data using storage devices (also referred to as storage arrays) containing a plurality of host interface units (host adapters), disk drives, and disk interface units (disk adapters). Such storage devices are provided, for example, by EMC Corporation of Hopkinton, Mass. The host systems access the storage device through a plurality of channels provided therewith. Host systems provide data and access control information through the channels of the storage device and the storage device provides data to the host systems also through the channels. The host systems do not address the disk drives of the storage device directly, but rather, access what appears to the host systems as a plurality of logical volumes. Different sections of the logical volumes may or may not correspond to the actual disk drives. The hosts, storage devices and/or other elements, such as switches and/or array components, may be provided as part of a storage area network (SAN).
Operating characteristics of the storage devices and/or other elements of the SAN may be monitored according to different performance statistics and measures. Operating characteristics may include, for example, performance data, capacity data, and/or discovery data, including configuration data and/or topology data, among other characteristics. As an example, operating characteristics of input/output (I/O) data paths among storage devices and components may be measured and may include I/O operations (e.g., measured in I/Os per second and Mbs per second) initiated by a host that will result in corresponding activity in SAN fabric links, storage array ports and adapters, and storage volumes. Such characteristics may be significant factors in managing storage system performance, for example, in analyzing use of lowering access performance versus more expensive higher performance disk drives in a SAN, or by expanding number of SAN channels or channel capacity. Users may balance performance, capacity and costs when considering how and whether to replace and/or modify one or more storage devices or components. Other characteristics may similarly be measured, including characteristics for types of distributed systems other than storage systems.
Known techniques and systems for performing root cause and impact analysis of events occurring in a system may provide automated processes for correlating the events with their root causes. Such automation techniques address issues of an outage causing a flood of alarms in a complex distributed system comprised of many (e.g., thousands) of interconnected devices. Reference is made, for example, to: U.S. Pat. No. 7,529,181 to Yardeni et al., entitled “Method and Apparatus for Adaptive Monitoring and Management of Distributed Systems,” that discloses a system for providing adaptive monitoring of detected events in a distributed system; U.S. Pat. No. 7,003,433 to Yemini et al., entitled “Apparatus and Method for Event Correlation and Problem Reporting,” that discloses a system for determining the source of a problem in a complex system of managed components based upon symptoms; U.S. Pat. No. 6,965,845 to Ohsie et al., entitled “Method and Apparatus for System Management Using Codebook Correlation with Symptom Exclusion,” that discloses a system for correlating events in a system and provides a mapping between each of a plurality of groups of possible symptoms and one of a plurality of likely problems in the system, and U.S. Pat. No. 5,528,516 to Yemini et al., entitled “Apparatus and Method for Event Correlation and Problem Reporting,” that discloses a system for efficiently determining the source of problems in a complex system based on observable events, all of which are incorporated herein by reference. It is noted, however, that such known techniques and systems may, in some circumstances, involve the maintaining of a large hierarchical relationship structure of faults and alerts and that may cause undesirable performance bottlenecks and require increasingly complex computations as network topology increases that may result in system performance degradation.
Accordingly, it would be desirable to provide a system that may be advantageously and efficiently used to identify faults and determine alerts in a network or other system topology with improved specificity, particularly as the system topology grows in size and complexity.