Software applications can include a number of processes that may run in a distributed architecture. Methods of managing software applications, however, are often disjointed and require administrators to carry out a number of tasks, such as running commands or scripts, on a number of devices, often in a coordinated sequence. Additionally, the manual management of processes can be cumbersome and time or cost prohibitive, requiring an individual or team to monitor, coordinate and maintain the applications on a near constant basis. As a result, and due to a variety of factors, there are often situations involving system unavailability and errors can be made.
One known environment uses a monitoring system for notification and relies on a combination of operation documentation, specialized scripts for specific remediation steps and a variety of trained personnel for application management. However, this approach often takes a significant amount of time to set up and requires near-constant monitoring while also being prone to errors.
Still other known environments utilize clustered solutions to automate failover of a specific component of an application, for example a database. These environments typically support a specific process across two or more closely coupled servers, but they do not include advanced business rules for managing the multitude of processes that comprise the totality of an application. Additionally, these solutions are often overly complex and cost prohibitive, in addition to being limited to one operating system at a single physical location.
Another problem commonly faced in computing environments is disaster recovery. Disaster recovery, or the recovery of data and systems following a complete failure or extended outage, typically is expensive to set up and may need a number of technicians to implement and support various complex elements. For example, operation support often involves a combination of standard monitoring systems and detailed failover and test plans. The failover steps themselves are often manual or leverage basic automation for specific functions. However, performing failover or, alternatively, failback, can be time consuming and may introduce risk as it is challenging to perform routine tests.