Attached as Appendix A to provisional application '659 is a document entitled “Multiplexed I/O (MPXIO)”, which gives implementation details for an embodiment of the invention. Also attached to provisional application '659, as Appendices B and C, are manual pages (man pages) that would be suitable for a UNIX (or other OS) implementation of the new MPXIO architecture. The U.S. Provisional Application No. 60/257,210, with its Appendices, is incorporated herein by reference.
This invention relates to a new system architecture providing multiple input/output (I/O) paths to client devices, such as storage devices, in a processor-based system or network.
As more systems use storage area networks (SANs), environments are created wherein multiple hosts are communicating with a given storage device. In both uniprocessor and multiprocessor settings, multiple paths are formed to the same storage device. These multiple paths can provide greater bandwidth, load balancing, and high availability (HA).
In I/O architectures currently in use, such multiple paths to storage devices may be provided as illustrated in the storage area network of FIG. 1. In this figure, a host system 10 is of conventional design, using a processor 20, memory 30, and other standard components of a computer system (such as display, user input devices, and so on). The system 10 also typically includes one or several host bus adapters (HBAs) such as HBAs 40 and 50, which communicate via switches 60 and 70 with storage devices 80 and 90, respectively. Alternatively, the storage devices may be multiported, in which case the switches may not be used.
Software layers 100 are used by the host 10, and as shown in FIG. 3, in systems currently in use a common architecture layer 110 may be provided above the HBA layer, such as applicant Sun Microsystems, Inc.'s “SCSA” (Sun Common SCSI Architecture). Above this layer are device drivers (such as applicant's “SSDs”, i.e. Sun Microsystems, Inc.'s SCSI disk drivers) 120 and 130. More specifically, these drivers 120 and 130 are in this example different instances of the same device driver.
Above the device driver layer is a metadriver (MD) 140. When the host 10 sends an I/O request to, e.g., storage device 80 (storage 90 being omitted from FIG. 3 for simplicity), the request is sent through the metadriver 140 to the drivers 120 and 130. If one of the paths to a storage device fails (e.g. path 82 or 84 to storage 80, or path 92 or 94 to storage 90), then it will be necessary to execute the I/O request via a path that has not failed.
In the case of symmetric storage devices, the paths may easily be load balanced, and failover for an I/O request is accomplished simply by using the non-failing path. For asymmetric devices, the system must be informed that the first path has failed. For instance, in FIG. 2 if a write command is sent via the metadriver 140 through driver 120 and SCSA layer 110 to HBA 40, and it turns out that path 82 to storage 80 fails, then this is communicated back up to the driver 120, which will typically execute additional tries. Each try may be very time-consuming, taking up to several minutes to execute. If path 82 has failed, this is wasted time; eventually, the driver 120 stops retrying, and the metadriver 140 will try the other path. Assuming path 84 is operational, the I/O attempt via driver 130 and HBA 50 will succeed.
In such a system, there are a number of inefficiencies, primarily including the time wasted retrying the I/O request along a failed path. A system is needed that eliminates such inefficiencies, and in particular that allows retrying of I/O requests more quickly along a working path.
Issues with Using Multiple Driver Instances
An issue that arises in connection with multipath devices is the structure of the Solaris (or other OS) device tree and the device autoconfiguration process. The OS device tree enumerates physical connections to devices; that is, a device instance is identified by its connection to its physical parent. This is in part due to the bottom-up device autoconfiguration process as well as the lack of self-enumeration support in the I/O controllers available at the time this framework was initially designed.
The presence of multiple device instances for a single device can lead to various issues. One of these is wastefulness of system resources, due to the consumption of system namespace and resources as each path to a device is assigned a unique device instance and name. Thus, as the number of HCIs to common pools of devices increases, the numbers of devices that can be hosted decreases. The minor number space available today for “sd” (SCSI disk) and “ssd” (which refers, e.g., to fibre channel SCSI disk device drivers) devices limits the Solaris OS to 32K single-pathed drives. Each additional path to a pool of devices decreases this by a factor of 2.
Each duplicate instance wastes kernel memory in the form of multiple data structures and driver soft states. Inodes in the root file system are also wasted on the duplicated /devices and /dev entries.
Another issue that arises is that system administrators, as well as applications, are faced with a challenges when attempting to understand and manage multipath configurations in the OS. Such challenges include:                1. prtconf(1m): Since prtconf displays the structure of the OS device tree, it lists each instance of a multipath device. There is no way currently for a system administrator to quickly determine which devices in the output are in fact the same device. Another piece of information that is lacking is the identity of the layered driver that is “covering” this device and providing failover and/or load balancing services.        2. Lack of integration with DR (dynamic reconfiguration): DR has no way of knowing if a device is attached to multiple parent devices; it is left up to the system administrator to identify and offline all paths to a given device. Some of the layered products (e.g., DMP products—dynamic multipathing products) actually prevent DR from occurring as it holds the underlying devices open and does not participate in the DR and RCM (reconfiguration coordination manager) framework.        3. Multiple names and namespaces in /dev: Each instance of a multipath disk device appears in /dev with a distinct logical controller name; the system administrator needs to be aware that a given device has multiple names, which can lead to errors during configuration or diagnosis. In addition, layered products define additional product-specific namespaces under /dev to represent their particular multipath device, e.g. /dev/ap/{r}dsk/*, /dev/dmp/{r}dsk/*, /dev/osa/{r}dsk/*, etc. Both administrators and applications need to be aware of these additional namespaces, as well as knowing that the multi-instance names in /dev may be under the control of a layered driver.        
Another issue that arises due to the use of layered drivers has to do with their statefulness. The layered driver approach becomes significantly more difficult to implement once stateful drivers such as tape drivers are deployed in multipath configurations. Driver state (such as tape position) needs to be shared between the multiple instances via some protocol with the upper layered driver. This exposes an additional deficiency with using layered driver for multipath solutions: a separate layered driver is needed for each class of driver or device that needs to be supported in these configurations.
Issues with Failover Operations
Yet another issue is that of failover/error management. Layered drivers communicate with the underlying drivers via the buf(9s) structure. The format of this structure limits the amount of error status information that can be returned by the underlying driver and thus limits the information available to the layered driver to make proper failover decisions.
In addition, the handling of failover operations by a system such as that shown in FIG. 1 can present other challenges. Switches 60 and 90 are multiport switches, providing redundant paths to storage 80 (paths 82 and 84) and storage 90 (paths 92 and 94). If path 86 to switch 60 fails, the system needs to activate path 96, which will be a different operation for storage device 80 than for storage device 90, which in general will be different types of storage devices.
An efficient way of activating paths common to different storage devices, such as when a failover operation is executed, is thus needed.