The present disclosure relates generally to computing and data storage devices. More particularly, the present disclosure relates to handling input/output (I/O) errors for applications in multiple computing device environments, such as in server cluster environments.
Common types of computing devices are desktop computers and server systems, with server systems frequently comprising high availability (HA) clusters. Such computers and servers may have both locally connected data storage devices and remotely connected data storage devices. For data storage, an increasingly common technology is referred to as storage area networking, or simply storage area network (SAN). SAN technology comprises connecting remote computer storage devices, such as disk arrays and optical storage arrays, to servers and other computing devices in such a way that the storage devices appear as locally attached storage to the computing devices and the operating systems that share the storage devices.
Certain aspects of technology for server and cluster infrastructures are well established. High availability clusters may be configured to monitor applications for failures and perform various types of recovery actions for the applications. Typically, a set of distributed daemons monitor cluster servers and associated network connections in order to coordinate the recovery actions when failures or errors are detected. The cluster infrastructure may monitor for a variety of failures that affect cluster resources. In response to a failure, the infrastructure may initiate various corrective actions to restore functionality of affected cluster resources, which may involve repairing a system resource, increasing or changing a capacity of a system resource, and restarting the application.
For some known systems, restarting an affected application may involve starting a backup copy of the application on a standby or takeover server. Restarting such applications often requires reconfiguring system resources on the takeover server and running various recovery operations, such as performing a file system check (fsck). Further, restarting applications generally requires that the applications perform necessary initialization routines. Even further, restarting the applications often requires additional application recovery operations when the data stores are not left in consistent states, such as replaying a journal log.