Many modern commercial computing applications rely upon large amounts of data storage. This is used, for example, to maintain customer records, invoices, warehouse inventory details, and various other forms of information. Accordingly, it is important that the storage facilities of such computer systems operate quickly and reliably.
FIG. 1 is a schematic block diagram of a typical known computer system 5 with storage facility. The system 5 includes a server 10 and two host bus adapters (HBAs) 12A and 12B, which are linked to the server by host bus 11A, and host bus 11B respectively. Various disk storage units are then attached to the two HBAs. Thus disk units 15A and 15B are attached to HBA 12A by links 14A and 14B respectively, and disk units 15C, 15D and 15E are attached to HBA 12B by links 14C, 14D and 14E respectively. Links 14A, B, C, D, and E typically use fiber channel or the small computer storage interface (SCSI) connections. Note that one of the HBAs (for example, HBA 12A) may utilize fiber channel, while another HBA (HBA 12B) may utilize a different protocol, such as SCSI.
It will be appreciated that there are a very wide range of possible configurations whereby storage units can be attached to server 10, and that FIG. 1 illustrates only a single such possibility. For example, there may be fewer or more HBAs for a given server, and the number of disk units attached to any given HBA is likewise variable. Furthermore, some systems may utilize tape storage in addition to, or instead of, disk storage, in which case these tape units are also connected to server 10 via an appropriate HBA.
FIG. 2 is an illustration of the sort of software typically utilised in controlling the storage hardware configuration shown in FIG. 1. This will be described in the context of the Solaris operating system (a form of Unix environment), available from Sun Microsystems Inc. Note that further details about this operating system can be found in: “Solaris Internals: Core Kernel Architecture” by Jim Mauro and Richard McDougall, Sun Microsystems Press/Prentice Hall, 2001, ISBN 0-13-022496-0, which is hereby incorporated by reference. Again, it will be appreciated that the software components shown in FIG. 2 are illustrative only, and that the skilled person will be aware of many other possible configurations and implementations.
The software structure depicted in FIG. 2 can be regarded as a form of hierarchy or protocol stack, at the top of which sits a user application 101 (which may be one of many application programs running on the server 10). In order to access the storage units 15, the application program 101 makes the appropriate system calls. These systems calls are received by the operating system (OS) 110 running on the server 10, and passed down a storage protocol stack. The stack includes a target driver 112, which is configured to understand about storage units at a high level (e.g. whether they are a disk or tape unit). Next is a SCSA layer 114, which implements the Sun common SCSI architecture (note that this protocol can actually be utilised over a fiber channel link as well as SCSI). The intent of SCSA 114 is to provide a generic interface for the higher level layers to send and receive SCSI commands without having to worry about details of the particular storage configuration in any given installation.
Underneath the SCSA 114 layer is the HBA device driver 116. As its name implies, this layer comprises code that allows the server 10 to interact with the HBA cards 12A,B. More especially, the HBA device driver 116 interacts with code 120 running on the HBA card itself. Code 120 is then able to interact with the actual disk units 15 in order to perform the desired storage access operation.
The hierarchy of FIG. 2 can be split into three main regions, namely a user level 130, a kernel level 140 and a device level 150. The user level 140 and the kernel level 140 both represent code running on server 10, whereas the device level code 150 is running separately on HBA card 12. The main piece of software at kernel level 140 is the operating system 110, which incorporates the target driver 112, the SCSA 114, and the HBA device driver 116. The operating system 110 is in effect a trusted piece of software, and accordingly it has many associated privileges in kernel mode to allow it to perform its processing. In contrast, the functionality that can be performed (directly) by an application 101 in user space 130 is much more limited than the functionality that can be performed by the operating system 110 within kernel mode 140. This helps to ensure that poorly behaved applications do not cause the entire system to crash.
An important concern at both the software and hardware level is to allow developers and engineers to obtain a good understanding of how a system operates. One reason for this for example is to be able to enhance or optimize code in order to improve performance (speed, reliability, and so on). Another motivation is to be able to diagnose and remedy any errors that are experienced.
One particular problem in a multi-component server system including various adapter cards and disk units is to be able to pinpoint the location of any known or suspected fault. To this end, it is desirable to isolate (in effect) the individual portions of the system. This then allows their behavior to be properly understood, independent of the state of the rest of the system.
Unfortunately however, in systems such as shown in FIGS. 1 and 2, the processing results (and any errors included in them) may be dependent upon the particular hardware configuration used. Moreover, given that such systems are usually designed to support a very wide range of possible equipment and configurations, a developer is frequently limited to being able to physically recreate only a very limited subset of such configurations.
For example, a particular customer may be experiencing a software problem with their installation. It is frequently impracticable to analyses these faults in detail on the customer machine itself, which may well be in a production environment and/or at some remote location. Accordingly, a support provider typically tries to reproduce these errors on a dedicated local machine for further investigation and diagnosis. However the support center may well not have the appropriate hardware to allow it to duplicate the precise configuration of the customer's system. Moreover, it may be difficult for cost or other reasons to acquire this hardware for testing purposes, especially if a wide range of units from different suppliers is involved, such as a server from one manufacturer, a first set of disk drives from another manufacturer, and a second set of disk drives from another.
One known approach to try to circumvent this problem is through the use of a simulator or emulator, such as shown in FIG. 3. This Figure matches FIG. 2, except that the HBA device driver 116 from FIG. 2 has now been replaced by a simulator 115, also known in this context as an HBA emulator. As its name suggests, the simulator 115 mimics the presence of a particular configuration of storage units (HBAs and disk drives) without these hardware devices actually needing to be present.
In order to achieve this, the simulator 115 receives storage commands from the application passed through the target driver 112 and the SCSA 114. The simulator then generates an appropriate response, which is passed back up to the application 101, where it appears as if it has come from a bona fide storage unit. This approach therefore allows the protocol stack (less the HBA device driver) within operating system 110 to be tested, but without needing the actual hardware storage units to be present (i.e. to be physically attached to the server 10).
The arrangement of FIG. 3 permits testing of a much broader range of configurations than would otherwise normally be possible. This testing can be used for investigating errors reported in existing installations, as well as for confirming that particular combinations of hardware and software (potentially from different suppliers) properly support one another—e.g. that they conform to the appropriate interfaces. The simulator can also be used in testing the development of new systems, including predicting the performance of large systems that have many disk units. This latter case may, for example, involve deciding how to allocate the disk units to available HBAs for maximum efficiency.
Nevertheless, the simulator configuration of FIG. 3 does suffer from certain drawbacks. In particular, since the simulator 115 is part of the operating system 110, any modification in the simulator behavior, such as perhaps to emulate a different configuration, generally requires the entire system to be rebooted. Furthermore, because the simulator 115 runs in kernel mode (since it is part of the operating system 110), any error that does occur will typically bring down the entire system. Given that the whole purpose of testing is often to push the system into error, it will be appreciated that a significant proportion of test runs will therefore result in a system crash, and so need a system reboot to proceed further.
It will be appreciated that in the above circumstances it is difficult to assess how a system typically behaves after a storage error (e.g. how robust the remainder of the processing is), given that a simulator error typically causes the whole system to crash. Furthermore, testing becomes rather time-consuming and expensive, since a system reboot will normally be needed between successive tests, either because test parameters are being changed, and/or because there was a system crash during the previous test run due to an error. (Note that such “error” may be intentional, in other words deliberately generated as part of the testing, or accidental, in that the test procedure itself for this particular investigation is still under development). Consequently, the use of an emulator such as shown in FIG. 3 in order to test the behavior of a storage system can represent a somewhat cumbersome and time-consuming activity.