1. Field of the Invention
This invention relates generally to computer storage, and more particularly to failover techniques for multi-path storage systems.
2. Description of the Related Art
Computer storage systems, such as disk drive systems, have grown enormously in both size and sophistication in recent years. These systems typically include many large storage units controlled by a complex multi-tasking controller. Large scale computer storage systems generally can receive commands from a large number of host computers and can control a large number of mass storage elements, each capable of storing in excess of several gigabytes of data.
FIG. 1 is an illustration showing a prior art computer storage system 100. The prior art computer storage system 100 includes computer systems 102, 104, and 106, and workstations 108 and 110 all coupled to a local area network 112. The computer systems 102, 104, and 106 are also in communication with storage devices 114 via a storage area network 116. Generally, the computer systems 102, 104, and 106 can be any computer operated by users, such as PCs, Macintosh, or Sun Workstations. The storage devices can be any device capable of providing mass electronic storage, such as disk drives, tape libraries, CDs, or RAID systems.
Often, the storage area network 116 is an Arbitrated Loop, however, the storage area network 116 can be any storage area network capable of providing communication between the computer systems 102, 104, and 106, and the computer storage devices 114. Another typical storage area network is a Fabric/switched storage area network, wherein the storage area network 116 comprises several nodes, each capable of forwarding data packets to a requested destination.
In use, the computer systems 102, 104, and 106 transmit data to the storage devices 114 via the storage area network 116. The storage devices 114 then record the transmitted data on a recording medium using whatever apparatus is appropriate for the particular medium being used. Generally the conventional computer storage system 100 operates satisfactorily until a failure occurs, which often results in data loss that can have catastrophic side effects.
It is more than an inconvenience to the user when the computer storage system 100 goes xe2x80x9cdownxe2x80x9d or off-line, even when the problem can be corrected relatively quickly, such as within hours. The resulting lost time adversely affects not only system throughput performance, but also user application performance. Further, the user is often not concerned whether it is a physical disk drive, or its controller that fails, it is the inconvenience and failure of the system as a whole that causes user difficulties.
As the systems grow in complexity, it is increasingly less desirable to have interrupting failures at either the device or at the controller level. As a result, efforts have been made to make systems more reliable and increase the mean time between failures. For example, redundancy in various levels has been used as a popular method to increase reliability. Redundancy has been applied in storage devices, power supplies, servers, and in host controllers to increase reliability.
A problem with incorporating redundancy into a computer system is that redundancy often causes additional problems with system performance and usability. For example, if redundancy in the form of multiple drive paths to a single device is used in an attempt to increase the reliability of a conventional system, the operating system is often confused into believing two separate physical drives are available to receive storage data, when only one physical drive is actually available.
In view of the foregoing, there is a need for method that can continue to provide access to I/O devices when a data path to the I/O device experiences a failure. The method should have the capability to automatically detect the failure and act to address the failure in manner that is transparent to the user. The method should be capable of increasing system reliability while not interfering with the production of the user.
Broadly speaking, the present invention fills these needs by providing an intelligent failover method, which automatically detects failure and recovers by rerouting I/O requests via an alternate data path. In one embodiment, a method for intelligent failover in a multi-path computer system is disclosed. Initially, a plurality of data paths to a computer input/output (I/O) device is provided. However, instead of the user viewing multiple logical devices for the single I/O device, embodiments of the present invention represent the plurality of data paths to the computer I/O device as a single logical computer I/O device. Then, during operation, an I/O request to access the computer I/O device is intercepted. A data path from the plurality of data paths to the computer I/O device is selected, and the computer I/O device accessed using the selected data path.
In another embodiment, a system for intelligent failover in a multi-path computer system is disclosed. The system includes a processor and a computer I/O device placed in communication with the processor via a plurality of data paths. In addition, a user interface module is included that is in communication with the plurality of data paths. The user interface module is used to represent the plurality of data paths to a user as a single logical computer I/O device. In addition, the user interface can be used to configure the failover system to fit a particular use or hardware configuration. The system also includes a failover filter driver that is in communication with the plurality of data paths. In operation, the failover filter driver selects a particular data path from the plurality of data paths to access the computer I/O device for intercepted I/O requests.
A failover filter driver for providing intelligent failover in a multi-path computer system is disclosed in another embodiment of the present invention. Included in the failover filter driver is an intercept code module that intercepts I/O request to a computer I/O device from an operating system. In addition, a manual-select code module is included that selects a data path from a plurality of data paths to the computer I/O device based on data path information provided from a requesting computer application. The failover filter driver further includes an auto-select code module that selects a data path based on characteristics of each data path in the plurality of data paths to the computer I/O device.
Advantageously, the embodiments of the present invention provide intelligent failover in multi-path computer systems, which greatly increases system reliability. Since data paths can fail, either because of a failed connection, failed controller, or any other reason, the ability to automatically detect failures and reroute data to alternate paths greatly increases system reliability. Other aspects and advantages of the invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, illustrating by way of example the principles of the invention.