The invention relates generally to the field of remote storage replication. More particularly, the invention relates to a method and apparatus for increasing application availability during a disaster fail-back.
Application downtime following a server failure represents an increasing problem due to the widespread use of computerized applications as well as the ever-expanding electronic commerce-driven economy. To increase data availability and reduce application down time, customers typically build a disaster recovery site which can take over during a disaster or server failure. A significant amount of time and planning goes into insuring that following a disaster, fail-over occurs as rapidly as possible. To this end, many vendors provide methods to reduce this downtime.
Remote storage replication is a technique which is primarily used for disaster protection. The processes are optimized by vendors to expedite the fail-over process from a primary site to a secondary or disaster recovery site. A problem that is less frequently looked at, however, is the time to return the operations to the primary site once the problems causing the fail-over have been resolved.
Customer Service Level Agreements (SLAs) define the level of service provided to customers and typically state the amount of downtime acceptable when a disaster strikes. However, customer SLAs rarely provide the requirements for returning operations to the original site once the problem has been resolved. With this in mind, most companies offer solutions that minimize the fail-over process but pay less attention to the requirements of retrieving the state of the primary site (fail-back). In addition, conventional fail-back processes are rarely, if ever, tested under real, live conditions. During such customer testing, the fail-over process is likely to be well-documented. Scripts are generally written and seem to work as efficiently as possible. However, on conclusion of the fail-over process testing, operations continue on the primary site without having had the need to restore the data replicated to the secondary site.
Conventional fail-back processes are lengthy and often require the entire population of the data from the secondary site back to the primary site. These processes, if ever required, can be time consuming and involve a fair amount of application unavailability or down time. Conventional fail-back processes often require complete resynchronization of the data to get back the data to the primary server and depending on the size of the data, this can be a very long process, adding to the amount of application downtime.
The present invention provides a method and apparatus for increasing availability of an application during fail-back from a secondary site to a primary site following a failure at the primary site. The method includes copying data from active storage volumes to secondary storage volumes on the secondary site while the application runs on the secondary site and updates the active storage volumes. Once the secondary storage volumes of the secondary site are updated, the data is re-synchronized from the secondary storage volumes on the secondary site to the primary storage volumes of the primary site. The steps of copying the data and resynchronizing the data are repeated for data updated by the application, during the re-synchronization, until a time required to complete the resynchronization step for the updated data is within an acceptable downtime for the application. Once this step is complete, the application is failed-back to the primary site by bringing up the application at the primary site. Application availability is therefore increased by limiting the application downtime to an acceptable down time.