In modern data centers, business data is often stored within networked storage systems. These storage systems, typically disk arrays, are usually connected to a fast, reliable and low latency SAN. Servers needing access to this data may also be connected to the SAN using a Host Bus Adapter (HBA), Network Interface Card (NIC), or other similar adapter or interface device (generally referred to herein as an Input/Output (I/O) controller). The disk arrays in the SAN can be presented as Small Computer System Interface (SCSI) disks to the Operating System (OS). The SCSI disks are, in turn, either presented up to an application running in the server as a File System or a raw disk device. The OS and applications running on the server may access the SAN storage array as a disk connected to the server.
In today's increasingly data-driven and competitive business environment, fast, efficient, error-free storage and retrieval of data is often critical to business success. The use of SANs has become widespread as the ability to store and retrieve massive amounts of data from a large number of storage devices over a large geographic area is now becoming a business necessity. Not surprisingly, the ability to quickly identify and fix problems and bottlenecks in storing and retrieving data across a SAN is a goal of any such storage system.
However, SAN errors and bottlenecks are often difficult to diagnose, and can be caused by subtle interactions with seemingly unrelated devices. For example, in the vast majority of installations today, the SCSI protocol is layered on top of the network protocol. As a result, the OS issues SCSI commands to the storage array to access data and control the storage arrays. Two types of commands are issued to the storage array. Data commands (e.g., read, write, report Logical Unit Number (LUN or, more simply, LU), and the like) are issued by the OS to access data stored in the storage array. Task management commands (e.g., target reset, LUN reset, etc.) are issued to control the command queues of the storage system. The task management commands issued by one server to a storage array can affect data access commands from another server to the same storage array. This also means the action of one server connected to a storage array can cause an error on another server connected to same storage array.
Like all networks, discovery- and link-related events can occur in the SAN that lead to availability and/or performance problems. Due the complexity of the SAN protocols involved and the size of the SANs in today's world-class data centers, it is essential to have tools to help quickly identify the root cause of any network or storage connectivity issues. However, the solutions that exist today, under the umbrella of SAN management software, do not provide the required information to quickly, if ever, determine root causes.
When a network problem is detected, some existing solutions can allow a server configured as an initiator to enter a debug or diagnostics mode. In such a diagnostics mode, agents can be employed to collect massive amounts of counter information and protocol event data (SCSI events, Fibre Channel (FC) events, discovery events, and the like) related to the fabric and target at each HBA and store the collected data in a system log file. However, the counters only provide information about the performance of a particular HBA or switch port (e.g., the amount of data passing through an HBA, the number of data packets sent, received, etc.), but do not provide a “big picture” of what is happening in the overall network. Counters are good at showing trends but are not effective, and sometimes misleading, when attempting to determine root causes of SAN availability or performance issues. The high-level event data is also generally limited to information about events seen at a particular HBA or switch port (e.g., a notification that a network component was inserted or removed, etc.), but as with the counter information, it does not provide a “big picture” of what is happening in the overall network. High-level events are also problematic in helping determine root causes, because they can often be intentionally induced by the end-user, or simply a symptom of a problem created by a root cause existing elsewhere.
Furthermore, this type of intensive data collection represents an overhead burden that affects the performance of the system because the system is still operating while the massive amount of event data is being collected. In addition, the mere act of operating in a diagnostics mode can mask the problem. Moreover, the mere collection of data does not provide any insight into the problem. The system log file must be reviewed, and the data collected at the time the performance issues were occurring must be interpreted in an attempt to diagnose the problem.
Today's SAN Management tools rely on counter and event information because that is all that is available to them. Protocol information (e.g. network protocol and SCSI protocol information) is much more valuable for uncovering root cause, but this information is typically locked up in the network devices and never exposed.
Some existing network diagnostics tools do not require special hardware placed at various locations throughout the network. Such tools communicate with the fabric switches (each of which has a Simple Network Management Protocol (SNMP) agent running inside it) using the SNMP protocol, and gather high level counter data (e.g., how many bytes have been transmitted in the last hour, the number of read commands in the last hour, etc.). However, this data is generally uninteresting, because the fabric is usually able to move all Input/Output (I/O) commands being demanded of it. Furthermore, when events happen at the fabric or process level, an endpoint (initiator or target) no longer sends any commands. The lack of activity detected by the counters indicates there may be a problem, but the type of problem is unknown.
Other existing SAN diagnostics tools require special hardware (e.g. deep analyzers) to be placed at various locations around the network to collect data and generate reports. Often, because this hardware is expensive, a single (or a few) hardware analyzer(s) must be moved around from HBA to HBA to gather needed data. However, the data collected by such hardware solutions also cannot develop a big picture of the network.
Some network switches have an option where a port can be directed to send information to analyzer hardware within the switch. Additional hardware external to the switch then encapsulates the information into Ethernet frames that can be read with dedicated software. This type of hardware solution represents another hardware add-on that provides for the viewing of lower level protocol items. It does this by extracting portions of the packets that the switch may not normally extract for the purpose of collecting the information, and does so only on a single port at a time. After the initiator stack obtains this information from the target and fabric, the information can be interpreted. However, in response to this information, the initiator can only control its own operation (e.g., not send as much data, try another route, etc.). Moreover, the initiator does not keep a “scorecard” of this information for diagnosing network performance issues.
In addition to the SAN diagnostics tools mentioned above, current HBA management tools can also provide some diagnostics capabilities. For example, Emulex Corporation's HBAnyware™ management suite, in its current configuration, keeps track of how HBAs are performing, how they are configured, enables HBAs to be configured remotely, and allows reports to be sent to remote locations on the network. HBAnyware™ is disclosed in U.S. application Ser. No. 10/277,922, filed on Oct. 21, 2002, the contents of which are incorporated herein by reference. The functionality of HBAnyware™ resides in HBA device drivers, but remote user space agents in the HBAs are also needed to perform the management functions.
HBAnyware™ collects configuration information about the HBAs using agents in the remote servers (HBAs) and causes the HBAs to be configured for different sizes and behaviors. HBAnyware™ communicates with the remote servers both in-band and out-of-band. With HBAnyware™, the HBA drivers in the remote servers communicate with each other to allow centralized management of the SAN and configuration of HBA hardware at a central point. For example, if HBAnyware™-compatible hardware is located somewhere in the SAN, it can be discovered by the HBAnyware™ software. Messages can be sent to and received from the HBAnyware™-compatible hardware that cause the firmware in the hardware to be updated, enable the configuration of the LUNs in the network, etc. All of this can be done from a central location rather than requiring each server to separately configure its own HBA.
HBAnyware™ can also collect some types of diagnostics information. With HBAnyware™, the agents collect data from the stack, but only data local to the HBA (e.g. link up, link down) is collected. Counter data is collected from the HBAs, but it is generally uninteresting, and no lower level protocol events, no latency data, and no capacity information is collected. Moreover, HBAnyware™ does not integrate the collected information into a system view.
Therefore, there is a need to collect specific interesting negative event data, along with command latency and system capacity data, to enable a picture of the operational health of the SAN to be determined and quickly identify the root cause of SAN problems.
Even in the absence of catastrophic SAN errors, SAN performance can be critical to business success. Therefore, reducing the time it takes to store and retrieve data across a SAN is always a goal of any such storage system.
FIG. 1 illustrates an exemplary conventional SAN 100 including a host computer 102, a fabric 104, a target 106 and one or more Logical Units (LUs) 108, which are actually logical drives partitioned from one or more physical disk drives controlled by the target's array controller. The host computer 102 includes an initiator 110 such as a Host Bus Adapter (HBA) or I/O controller for communicating over the SAN 100. A representative application 112 is shown running on the host computer 102. The fabric 104 may implement the Fibre Channel (FC) transport protocol for enabling communications between one or more initiators 110 and one or more targets 106. The target 106 acts as a front end for the LUs 108, and may be a target array (a single controller with one or more ports for managing, controlling access to and formatting of LUs), Just a Bunch Of Disks (a JBOD) (a collection of physical disks configured in a loop, where each disk is a single target and a LU), a Switched Bunch Of Disks (SBOD®), or the like. An example of a conventional target array is an EMC Symmetrix® storage system or an IBM Shark storage system.
In the example of FIG. 1, the application 112 may employ a file system protocol and may initiate read or write I/O commands 114 that are sent out of the host 102 through the initiator 110 and over the fabric 104 to target 106, where data may be read from or written to one or more of the LUs 108. When an I/O command 114 is transmitted, there is an expectation that the I/O command will be completed, and that it will be completed within a certain period of time. If the read or write operation is completed successfully, an I/O command completion notification 116 will be delivered back to the application 112. At other times, however, if a target 106 or LU 108 is overloaded or malfunctioning, the I/O command may not complete, and no I/O command completion notification 116 will be sent back to the application 112. In such a situation, the only feedback received by the application 112 may be an indication that the I/O command timed-out, and a reason code providing a reason for the timeout.
To assist a SAN system administrator in identifying problem targets 106 or LUs 108 and maintaining an efficient SAN with a balanced and fair LU workload, it is desirable to know the average I/O command completion time for I/O commands sent to each LU 108 in a target 106. In particular, it would be desirable for a system administrator to receive continuously updated LU-specific average I/O command completion time information for each LU in each target the initiator discovered in a dynamic manner. Such information would enable the system administrator to identify where latencies are being injected into the SAN or identify latencies that are worsening, and make adjustments accordingly. For example, if the average I/O command completion times for two different LUs 108 in the same target 106 are drastically different, for a similar I/O pattern and RAID level (e.g. greater than 25% difference), this may be an indication that the LUs are unbalanced and that there is some unfairness at the target, and that perhaps the LU loads need to be re-balanced to achieve a greater degree of fairness. On the other hand, if the average I/O command completion times for all LUs 108 at a target 106 are rising, over time, and becoming too high, this may be an indication that the target is receiving too many I/O requests and that more storage needs to be added so that some data can be shifted to the new target. In other words, it is desirable for the application to detect unfairness among LUs and/or overloaded conditions at a particular target.
However, conventional fabric-attached storage solutions do not provide average I/O command completion time information for an initiator 110 and target 106 in a SAN 100, or for multiple initiators and targets in a SAN. Conventional systems either do nothing, or wait for an initial I/O command failure to occur before taking corrective action such as limiting the outstanding I/O count. The problem with this approach is that by the time the storage device provides an indication that a problem exists, it may be too late to influence the storage device or it may become very expensive to react from an application point of view.
It should be noted that for directly attached and controlled storage such as conventional parallel Small Computer System Interconnect (SCSI) systems where the storage is directly connected to the host without an intervening target array, tools do exist for calculating the I/O command completion time for a particular I/O command and an average I/O command completion time, such as iostat-v, sysstat version 5.0.5, © Sebastien Godard, the contents of which are incorporated by reference herein. In such systems, a statistics counter in the SCSI layer keeps track of I/O command completion times, and monitoring tools within the operating system display this parameter. However, the average I/O command completion time is merely an information-only health indicator, because directly-attached storage systems by their very nature cannot make use of this information to adjust storage allocations and improve the response times of I/O commands.
Therefore, there is also a need to compute average I/O command completion times on a per-LU, per-target basis within a fabric-attached storage system to enable a driver within a host, or a system administrator, to make adjustments to improve the efficiency of the SAN.
One of the causes of increased latency in the execution of I/O commands in SANs is the oversubscription of resources. The responsiveness of devices such as a disk array is a function of the queue depths of queues in their associated production servers and the handling capacity of their storage array ports. Therefore, reducing problems associated with the oversubscription of resources across a SAN is always a goal of any storage system.
In today's datacenters, queue depth is one of the “knobs” available to the storage administrator to balance the system. When managing queue depths, a SAN can be thought of in terms of many other queuing problems. The SAN has a fixed I/O handling capacity, and that capacity needs to be shared by all the applications that are demanding I/O.
Today's SAN Management solutions focus on the capacity issue being in the fabric itself, or the disk capacity at the array. For example, Storage Resource Management (SRM) captures and reports, separately, SAN Management data (link utilization, for example) for switches and Storage Management data (primarily storage capacity) for arrays. However, the fabric is rarely the I/O capacity bottleneck. More often, the bottleneck is either at the server or at the storage controller. At the server, I/O handling capacity depends on a number of factors, including memory availability, kernel architecture, and Central Processing Unit (CPU) power. At the storage controller, I/O handling is also dictated by a number of factors, including the system architecture, the controller front-end, the amount and speed of cache, the controller back-end, and the actual disks themselves. When there are performance issues that need to be managed with queue depths, administrators are forced to use a completely manual process today.
Managing performance issues requires an understanding of the current mapping of initiators to target ports and backend devices. In addition, understanding the queue depth demand of every initiator, the I/O handling capability of the storage controllers, and an understanding of the actual queue demand placed on the system by every initiator is highly desirable. All of this information must be put together to help understand where the performance issue is, and what areas can be leveraged to mitigate or eliminate the performance issue. Putting together this information is becoming more difficult in today's data centers. With virtual server technology, more queuing demand is placed on storage controllers by fewer initiators and servers. Further, the mapping of all the queue demand to the storage controllers is more difficult to discern and aggregate.
Therefore, there is also a need to quickly and easily obtain capacity information for resources in the SAN to determine when oversubscription is becoming a problem and to initiate fixes to alleviate the oversubscription.