Modern computer data storage systems, such as storage area networks (SAN) in enterprise environments often use the Fibre Channel (FC) network technology to provide high-speed (e.g., 2 to 16 gigabit/second) data transfers. A Fibre Channel network comprises a number of ports that are connected together, where a port is any entity that actively communicates over the network (either optical fiber or copper), where a port is usually implemented in a device such as disk storage or a Fibre Channel switch. The Fibre Channel protocol transports SCSI commands over Fibre Channel networks, and network topologies include point-to-point, arbitrated loop (devices in a ring), and switched fabric (devices/loops connected through switches).
As the amount of data in enterprise applications increases, the use of high availability (HA) storage networks utilizing FC devices has increased. A high availability storage framework allows transparent storage of the same data across several physically separated machines connected within a SAN or other TCP/IP network. For such systems, it is important that storage devices and data centers are able to be efficiently and quickly recover from failure conditions or even routine maintenance. In a high availability storage system, there are two nodes to provide storage services via Fibre Channel connections. When an HA active node is in a panic condition, it is expected that the surviving node will take over the Fiber Channel connections from the panicking node as soon as possible. The purpose is to keep providing service to the clients (initiators) at the other end of the FC network, e.g., in a VTL (Virtual Tape Library) service. Failover services from a failing node to the surviving node must occur as quickly as possible, as this is one of the most important goals of HA systems. However, it is challenging to failover FC connections to the surviving node.
At present, there are no standard solutions to this problem, though different brute force methods are typically used. One approach to failover an FC connection is to wait for OS panic to finish, and then reboot or shut down the system. When the system is powered off, the FC devices on the system are also powered off. However, it typically takes a very long time (e.g., on the order of minutes or even hours in a server with big kernel memory size) because the OS disk dump process during panic may take that much time, and there are no known ways to make the disk dump process any quicker. It is also possible to force a system reset and interrupt the OS disk dump process. In this case, however, the user will no longer have disk dumps, which is not desirable because the disk dump provides a lot of information of the system memory and software execution scenario during the panic period. It is also one of the most important ways to help user to analyze the panic problem and then fix the problem. Another possible approach is to let firmware reset the device. But this approach requires firmware changes and introduces complexity to the firmware. Compared to standard software changes, it is generally hard to develop and maintain firmware updates. Also, firmware is not an entirely reliable way to solve the problem considering the usage is in a high availability scenario. If it is considered to only reset some specific devices, then the appropriate device drivers could be updated to support reset. However, this requires changes for each driver and also requires that the hardware supports device specific reset functionalities.
With regard to the client (SCSI initiator) side and the overall topology, it is possible to configure multipath FC connections between clients and the two HA nodes. When one HA node fails, some of the clients may experience FC I/O failures, and then the I/O path will be switched to another path to the other HA node. However, there are no industry standards for multipath solution in an FC VTL initiator. Furthermore, it is quite difficult to implement multipath solutions for all of the initiators considering there are many different types of operating systems and network configurations.
What is needed therefore is a method of resetting FC host bus adapter devices when an operating system panic condition occurs and help failover FC connections as soon as possible, and to provide a disk dump process so that users continue to have crash dumps for panic analysis.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions. EMC, Data Domain, Data Domain Restorer, and Data Domain Boost are trademarks of EMC Corporation.