A storage server is a computer system and a form of storage controller that is used to store and retrieve data on behalf of one or more clients on a network. A storage server operates on behalf of one or more clients to store and manage data in a set of mass storage devices, such as magnetic or optical storage-based disks or tapes. A storage server may be configured to service file-level requests from clients, as in the case of file servers used in a Network Attached Storage (NAS) environment. Alternatively, a storage server may be configured to service block-level requests from clients, as done by storage servers used in a Storage Area Network (SAN) environment. Further, some storage servers are capable of servicing both file-level and block-level requests, as done by certain storage servers made by NetApp®, Inc. of Sunnyvale, Calif.
In a SAN environment, storage services provided by a storage server can be integrated into a client system's operating system in such a way that, to the user applications running on the client system, the remote storage server and its storage capacities appear as locally attached. In a NAS environment, the client system is aware of the storage server being remote, and can use file-based protocols such as Network File System (NFS) to access the files stored in the remote storage server. The client system's operating system can also provide an abstraction layer that shields the NAS storage from the user applications executing on the client system. During execution, user applications running on the client system can access the storage services provided by the remote storage server in a similar fashion as accessing the client system's local storage devices.
The local and/or remote storage services available to a client system can be disrupted by hardware failures or software errors that occur at the client system, the remote storage server, or the network in between. Data stored on the client system's local storage devices or the network storage server can also be corrupted due to various reasons. Once stored data is no longer available or becomes corrupted during a user application's execution, the operating system and/or the file system often raise exceptions (warnings or error signals) to the user application whenever it attempts to access these stored data. Upon receiving the exceptions, the user application usually aborts its normal operation and tries to process the exceptions instead. If the exceptions are not properly handled, the whole application can terminate abruptly. Even with sophisticated exception handling logics, the user application is often unable to recover from the service disruption or continue its normal operation. In this case, the user application can at best perform a graceful termination.
To restore the disrupted storage services, the failed hardware can be replaced; the faulty software can be repaired or reconfigured; and the corrupted data can be restored from previous backups. Still, in many situations, these storage service restorations may alter the original storage configurations. For example, during restoration, a corrupted storage Logic Unit Number (LUN) may need to be disconnected from the operating system with its allocated space released. The LUN's original storage partitions and/or configurations may be deleted and recreated, and the original drive letter that was assigned to the LUN may no longer be available. Thus, even for user applications that are not accessing the storage services during disruption, the restoration of the disrupted storage services may affect them nevertheless. These user applications may have to be shut-down and re-launched to reconnect to the restored LUN. Otherwise, accessing the no-longer-in-existent or altered storage services may raise further exceptions.
Storage service disruption imposes additional burdens to large-scale mission-critical applications, such as Enterprise Resource Planning (ERP) applications, etc, which are often required to provide continuous availability with minimum downtime. A large-scale application is often implemented with application components distributed to multiple systems and environments. When local or remote storage services are disrupted at one of the systems or environments, the impact could propagate to the application components that are running on other systems or unrelated to storage access. Further, large-scale applications often require a long latency time for shutting down and restarting up. Thus, even though physical data might be preserved via data redundancy at the device or storage server level, the disruption of services and the re-launching process can severely undermine the continuous availability of these applications.