A Storage Area Network (SAN) is a dedicated high-speed network connecting multiple storage servers (hosts) to multiple storage devices. The SAN model creates a pool of storage that can be shared by multiple consumers, consolidating heterogeneous storage resources across an enterprise. Communications within the SAN are typically optimized for carrying input/output (I/O) traffic between the storage servers and the storage devices and, possibly, among the storage devices themselves without intervention of the server. Application traffic is generally handled by a separate messaging network, such as a LAN or WAN.
Large SANs may include thousands of different inter-related logical and physical entities. When an application performance problem is detected and reported, either by the user of an application or by an automatic monitoring tool, the root cause of this performance problem can be anywhere in the system, including the SAN, LAN, storage server, database, application server, client machine, etc. Some currently-available management tools monitor the performance of individual components in the SAN and report to the system manager about deviations from the normative behavior, usually defined in terms of performance thresholds on the operational values of performance metrics of the components. For example, the IBM TotalStorage Productivity Center for Fabric (formerly known as the IBM Tivoli SAN Manager) provides functions such as automatic resource and topology discovery, monitoring and alerts, zone control and link-level error prediction capabilities. The system administrator, however, is expected to determine the relationships between the reported deviations (which may be scattered throughout the system) and the performance problems detected at the application level, based on his or her knowledge of the system. Although this approach may be feasible for small SANs, it becomes intractable as SAN size grows.
Preliminary attempts at developing automatic tools for finding the root cause of network performance problems have been described in the patent and technical literature. For example, U.S. Patent Application Publication US 2002/0083371 A1, whose disclosure is incorporated herein by reference, describes a method for monitoring performance of a network using topology information that identifies interconnections and interdependencies among network components. Based upon the topology information and various forms of mapping information, a user is able to navigate through a Web-based user interface to determine root causes of network problems.
U.S. Patent Application Publication US 2004/0103181 A1, whose disclosure is incorporated herein by reference, describes a performance manager and method based on a system model that includes measured entities representing the operational characteristics of the system components and relationships among the measured entities. The performance manager uses an interaction model to determine the most relevant entities in the system model affecting the system performance. An operator reviews the relevant entities and applies controls to selected entities to manage the overall system performance and to resolve problems affecting the components in the system.
Kochut et al. present a three-stage performance management algorithm in “Management Issues in Storage Area Networks: Detection and Isolation of Performance Problems,” IFIP/IEEE Ninth International Network Operation and Management Symposium (NOMS '04, Seoul, Korea, March, 2004), pages 593-604, which is incorporated herein by reference. The authors extend the static dependency map of the SAN topology into the host server. The first step of the algorithm is to understand the baseline performance of the SAN as viewed from the logical volumes of the host. The system is then monitored continuously, and the monitoring data are parsed for performance degradation at the logical volume. The final step merges the identification of suspected volumes with the dependency map to isolate a subset of the SAN where contention may be occurring.