Field of the Invention
This invention relates to a method for processing a disaster recovery setup using a policy-based automation engine controlling at least two sites of a computing environment, a computing environment for the disaster recovery setup and a computer program product containing code portions to process the disaster recovery setup.
Description of Background
Within enterprise computing centers dedicated to support an IT infrastructure, human operators are employed to keep these diverse applications up and running. In order to achieve high levels of availability, software programs—typically called ‘automation product’—are used to support the operators. IT infrastructure consists of systems hosting applications and direct access storage devices for saving persistent data required by the applications.
An IT infrastructure which is prepared for disaster scenarios—like a complete power outage of the building hosting this infrastructure—has typically been setup under consideration of the following points:                a) A backup site has been defined and setup. Usually the production is running on a production site (site 1). The production is moved to the backup site only in cases where the production site is not available anymore (site 2);        b) Systems are available on site 2 to host the production applications. Applications are installed and configured ready-to-run on site 2;        c) Data which is required by those applications is available and current on site 2, thus allowing the application to restart on site 2 without losing the complete state of operations they have been in on site 1.        
To be prepared for point c) of this described setup, replication techniques have been established to ensure that data written to a storage device on one site 1 is almost instantly copied over (also called “replicated”) to site 2.
In these Data Replication (DR) enabled setups it is crucial for application that the required (i) data on the storage device is accessible on the site where the application is running, (ii) replication is enabled and working, and (iii) replication is directed to the opposite site. ii) and iii) are mandatory if the applications are required to be DR enabled at any time.
Data replication can be implemented by different technologies. Some storage devices offer synchronous replication to another storage device of the same type as a build-in service. This kind of data replication is usually identified as “storage-based replication”. Other storage devices do not implement this kind of service. For this situation, software solutions do exist and implement the data replication usually on the device driver layer of the operating system to which the storage device is attached.
Typically, the replication direction has to be configured before the replication task itself is started. Whenever it is required to change the replication direction, the following steps are executed:
1. Stop the data replication;
2. Reconfigure replication direction;
3. Start the data replication.
In a functional view, automation product often handles different scenarios where an application and the IT resources must be, for example, stopped, moved or restarted either in planned scenarios for maintenance purposes or unplanned scenarios when failures occur. Used automation products are typically script-based or policy-based. Scripts are often written by a system application programmer or by some system administrator to implement the desired automation support. It is also possible that automation products are policy-based, i.e. they use an abstract configuration description of the application and the IT resources needed to run the application.
As mentioned above, scripts are often written by a system application programmer or by system administrator staff to implement the desired automation support. The drawback of the script-based approach is that any change in hardware, operating system, middleware, data replication technique or application setup results in very labor intensive updates and tests of the automation scripts. Software vendors sell automation products, which typically have to be customized before they can be used to automate IT resources. These vendor automation products are also often script-based. This means that the system administrator staff must write script plugins to implement the desired automation support. Here, the drawbacks are identical to the ones described above.
Other vendor automation product is policy-based. In this context an ‘automation policy’ is an abstract configuration description of the application in the IT resources needed to run the application. A prior art automation policy typically consists of ‘grouping concepts’ and of relationships. In comparison to other approaches, the policy-based approach has benefits. It is easy to adapt to changes in hardware, software, operating system, middleware or application setup, because only a few changes in the automation policy definition are needed to reflect a new configuration.
Policy-based automation products typically support the following entities:                A definition of resources with a defined availability state. These resources typically express hardware or software entities.        A grouping concept to aggregate resources for an intuitive, single point of control. Groups also generally have a defined availability state.        A concept for relationships between defined resources and/or groups. Relationships define how the availability state defined for multiple resources will be reached by the automation product.        
Relationships are constraints on the automation behavior. Examples of relationships include ‘StartAfter’, ‘StopAfter’ and ‘ForcedDownBy’. The automation manager respects relationships as part of the policy, so they influence the automation behavior. For example, if a resource gets desired state online that has a StartAfter relationship to another resource, the latter one is started before the former one is started.
Furthermore, automation products can be goal driven or command driven. Goal driven automation means that the automation software knows the automation goal for each resource it manages. Automation goals are typically called requests. There may be multiple competing and possibly conflicting requests on a single resource. Requests have a priority and the request with the highest priority wins and determines the so-called ‘desired state’ of the resource. Possible desired state values for a resource are for example ‘online’ or ‘offline’. The automation software pursues the winning request of a resource by trying to keep the resource in its desired state. In a command driven automation product, the last issued command against a resource, i.e. start or stop, always wins. This means that there cannot be multiple or competing commands for a resource at a time. The automation product of the present invention is goal driven.
In an event-driven automation product, the automation engine subscribes for events of the managed resources. The managed resources have the obligation to inform the subscribers in case of any status change. Thus, a new automation cycle is triggered either by events being received and/or requests being submitted or removed. Event-driven system automation has the advantage that a permanent re-evaluation is not required, which thus saves valuable computational resources.
The term automation engine or product is used here for software, which automates operator tasks for the purpose of continuous or high available applications even within the scope of a multiple site disaster recovery setup including automated site switches. Applications and their required data, also called resources in this context, are kept highly available and correctly directed by the automation product.
Functions or services delivered by resources as defined above are typically not seen as entities of policy-based automation products. That means generally only the availability state is monitored and ensured by state of the art cluster high availability products. Usually it is not possible to express functional side aspects of such resources.
Looking at the example of data replication, an automation product is normally only able to ensure that a data device is working (available) but it is not checked that it is also currently providing a specific service configuration such as replicating data to another specific site.
With reference to FIG. 1, a software application resource 101 on site 1 is hosted by server 102 and is dependent on the availability of the storage device 103, since the data of resource 101 is written to the storage device 103 via the I/O path 107. On site 2, there is an identical setup with a stopped software application resource 104 that is the backup of software application resource 104 and that is hosted by server 105 and is dependent on the availability of the storage device 106 if it is running.
The requirement for resource 101 is that it must be prepared to restart after a disaster and to continue working within the state it has been at the moment the disaster happened. Therefore, resource 101 has the requirement that data written to the storage device 103 is replicated to site 2. The storage device 103 is configured to replicate all data which is being stored on it to the storage device 106. This replication is setup in one direction only.
In a site failover situation, also called a site switch, the application resource 104 can be started on site 2 and can takeover the work of resource 101 based on the data it loads from storage device 106.
As can be seen from this scenario, it is crucial that the application 101 respectively 104 is only started when:                a) The data is available and up-to-date on the same site where the application is going to be started;        b) The replication is targeted to the other site. This makes the data on the storage system accessible at the same site where the application is going to be started.        
All other runtime situations would cause the application not to be prepared for a disaster scenario, which would be a violation of the requirements against it.
FIG. 2 shows the setup in a disaster case. The software application resource 205 is now running on server 206 and is accessing the data on the storage device 208 via the I/O path 208. The server 202 is broken, so the software application resource 201 is no longer running and the I/O path 203 is not established. However, the storage device 204 is still available so data can be replicated from storage device 208 to 204.
In order to transition from the state described in FIG. 1 to the state in FIG. 2, a number of manual steps have to be performed. FIG. 3 shows these manual steps. Depending on whether the application on site 1 is running or not, the application has to be stopped (301). The application might not be running if the application has crashed itself and cannot be restarted or the server has crashed. After that, the replication has to be stopped (302) and the direction has to be changed to Site2-Site1. If the data volumes on site 1 are still available, the replication can be started again (304). Finally, the application is started on site 2.