1. Field of the Invention
The present invention relates to computer-based information storage systems. More particularly, the present invention relates to systems and methods for permitting a failed host bus adapter (HBA) to be repaired and replaced online, i.e., without having to shut down the host computer in which the HBA resides.
2. Background of the Invention
The increased importance of information technology in business processes has fostered increased demands for data storage systems that combine the features of high storage capacity, high reliability, efficient scalability, and cost-effectiveness. Early computer systems relied heavily on direct-attached storage (DAS) systems consisting of one or more disk drives coupled to a system bus. DAS systems were not well adapted to satisfy these demands. More recently, storage area network (SAN) technologies have been implemented. SAN architectures permit organizations to uncouple application servers from data servers to provide storage systems with greater capacity, higher reliability, and higher availability.
In operation, users access a storage system through a file system implemented in a storage system computer, typically referred to as a host computer. The term file system refers to the logical structures and software routines, usually closely tied to the operating system software, that are used to control access to storage in the system. A host computer receives requests from external devices for information stored in the storage system, processes the requests, retrieves the desired information from the storage devices, and transmits the information to the external devices. Many SANs implement a high-speed connection, e.g., a Fibre Channel (FC) connection, between the host computer and the storage devices. This connection is enabled by a Host Bus Adapter (HBA), which provides a communication connection between the host bus (typically a PCI bus) and the FC connection.
SAN systems implement redundancy to enhance the reliability of the system. For example, RAID (Redundant Arrays of Inexpensive Disks) techniques are used to enhance data storage reliability. In addition, in many SAN systems data storage devices (e.g., disk drives) are connected to redundant disk controllers by at least one high-speed data communication link, e.g., a Fibre Channel Arbitrated Loop (FCAL), to provide a network of interconnected storage devices. Further, SAN systems may implement redundant components such as power supplies, cooling modules, disk devices, temperature sensors, audible and/or visible alarms, and RAID and other controllers to increase system reliability. If a component fails, then the redundant component assumes the functions of the failed component so the storage system can continue operating while the failed component is repaired or replaced.
Host computers may include two or more HBAs for providing redundant connections between a host computer and storage devices in the SAN. If one of the HBAs fails, then the host computer""s operating system redirects communications with the storage devices through an active HBA. The failed HBA may then be replaced or repaired. SANs are often implemented in computing environments that must meet stringent availability requirements. To meet these requirements, it is desirable to keep host computers operating continuously. Accordingly, it is desirable to provide systems and methods for enabling replacement of failed HBAs while the host computer remains on-line, i.e., operational.
The present invention addresses these and other problems by providing a storage system architecture and operating method that permits a failed host bus adapter (HBA) to be repaired and/or replaced online, i.e., without shutting down the host computer system. The present invention may be implemented in a host computer that uses a Plug-and-Play capable operating system, such as the Microsoft Windows(copyright) brand operating system, that supports the Windows Driver Model (WDM) architecture.
In one aspect, the present invention uses one or more host bus adapter (HBA) specific filter drivers and a storage device SCSI class driver to provide multi-path functionality. The filter driver intercepts responses to Plug-and-Play requests from the underlying HBA driver. These responses are modified to prevent standard Microsoft operating system SCSI class device drivers from being loaded for devices attached to the HBA. Instead, the modified responses cause a multi-path SCSI class device driver to be loaded. The filter driver also monitors the status of the paths to a device and, upon request, provides status information to the multi-path SCSI class device driver. The multi-path SCSI class driver may use this status information to decide whether to make a particular path a primary path. The actions required to make a path a primary path may be performed by the filter driver, e.g., through a function call to the filter driver initiated by the multi-path SCSI driver.
The software architecture of the multi-path SCSI class driver permits the replacement of a failed redundant HBA. In addition, the multi-path SCSI class driver serves several purposes. First, it provides the storage device specific functionality required by the operating system, i.e., functionality equivalent to the corresponding Microsoft SCSI class device driver. These device specific driver requirements and interfaces are well documented in the Microsoft Windows Device Driver Development Kit (DDK). Second, the multi-path SCSI class driver implements two layers of device objects to enable multi-path functionality. The upper layer consists of a single xe2x80x9cmasterxe2x80x9d device object for each device. Beneath the master device object, at the lower layer, a xe2x80x9ccomponentxe2x80x9d device object is created for each path that exists to a device. One or more component device objects are linked to a master device object. The master device object acts as a switch to route I/O to the component device object that represents an active or available path. The master device object contains logic to re-reroute I/O to one of the redundant paths in the event of a failure.
The master device object is not placed in the PnP device stack of an HBA. This allows the device stack associated with any path to be removed from the component device object down, while maintaining a persistently present device (i.e., the master device object) to upper levels of the operating system.