The introduction of storage area networks (SANs) has dramatically changed the complexity of path management components. The number of devices that can be connected to a host has increased by an order of two. Similarly, the number of paths to any given storage element has gone up by a factor of two or so. The larger number of paths and devices, spanning a much larger area, has increased failure probability of the hardware components in the path from the application to the storage.
The kinds of path errors encountered in a storage area network environment are rather different to those found in, for example, direct attached storage environments. Many storage area networks incorporate a dynamic multi-pathing type routing arrangement where traffic is shared between all available paths between endpoints. One path error which can be particularly difficult to address in a storage area network environment is the intermittent hardware failure. This leads to repeated invocation of ‘error handling procedures’ of a dynamic multi-pathing system which in turn causes degradation of performance. The performance degrades proportional to the frequency of the switching of the hardware component(s) between failed and healthy states. Sometimes the performance degradation can become so severe as to make the entire system completely unusable.
As SANs have become the de-facto configurations of storage industry, a large number of different hardware vendors have provided products suitable for use in such an environment. Whilst this provides significant consumer choice, it can lead to very high heterogeneity within a given SAN environment. Disk arrays from a variety of vendors coupled with various SAN switches from different vendors increase the heterogeneity of the network. Further, the heterogeneous hardware usually does not comply with a common standard and simply increases the interdependency and the complexity of the entire configuration.
Given this situation, the host software has very limited knowledge of the complex network and therefore, the host software performs poorly when the configuration is destabilized even transiently, for example by an intermittent hardware failure or by reconfiguration of the SAN topology. The result of this sporadic destabilization is detrimental and results in downgraded application throughput due to delayed detection and delay in subsequent recovery.
The throttling mechanisms that exist in I/O subsystems of devices typically connected within a SAN are inactive type mechanisms. The throttling is directly dependent on the queue maintained by the device and therefore, throttling will not happen from device driver until the need for throttling is reported by the device. Also, the throttling kicks in only when the error is reported by the SAN as packets are lost. A typical conventional I/O subsystem directly relies on the device queue maintained by the target device. When the device is swamped with I/O requests, the queue goes full and the device drops the requests. The device driver observes I/O timeout and throttles the requests to the device. This has been the approach in the standard SCSI (small computer system interface) driver for a long time, but the technique is inefficient as there is flooding at the I/O subsystem until the queue-full condition is reported by the device. Thus it is clear that such inactive techniques cannot prevent the flooding of I/O subsystems or the SAN and therefore, these known techniques cannot ensure quick recovery following destabilization of the SAN.
Certain techniques that relate to network performance issues have been presented in the following publications.
“Creating Performance-based SAN SLAs Using Finisar's NetWisdom”, a corporate whitepaper published in 2006 by Finisar Corporation of Sunnyvale, Calif. describes a system whereby a service level agreement for a SAN based on performance as well as uptime/availability statistics. The document suggests using metrics such as exchange completion time and queue depth to measure performance in context of creating and assessing a service level agreement.
“NetWare Cluster Services: The Gory details of Heartbeats, Split Brains and Poison Pills”, document ID 10053882, dated 18 Feb. 2003, by Novell, Inc. of Waltham, Mass. sets out information related to Novell NetWare Cluster Services clusters. The document suggests using LAN (local area network) driver and protocol stack statistics to determine whether a bad NIC (network interface controller) is intermittently dropping packets and thereby causing a split brain condition in a clustered NetWare environment.
“Defending Against Distributed Denial-of-Service Attacks with Max-Min Fair Service-Centric Router Throttles”, Yau, D. K. Y. et al, IEEE/ACM Trans. On Networking, Vol. 13, No. 1, February 2005, pages 29-42 describes a mechanism to throttle packets at router by monitoring the incoming traffic rate and to identify the IP address of the sender/receiver. This proposed technique for defending IP based networks against distributed denial of service (DDoS) attacks causes throttling to be triggered when the router is swamped with packets.
“Scalability of Reliable Group Communication Using Overlays”, Baccelli, F. et al, presented at IEEE Infocom, Hong Kong, 7-11 Mar. 2004, describes a throttling mechanism related to IP based networks.
The present invention has been made, at least in part, in consideration of drawbacks and limitations of conventional systems.