Computer networks have become backbones of companies throughout the world. Even if a company does not provide products or services over the interact, computer networks within the company improve employee productivity by providing employees with instantaneous access to millions of bytes of data. In fact, many companies are unable to function when the company's computer network fails. Thus, it is imperative that companies have reliable computer networks with 99.999% up time.
Conventionally, a computer network may be provided with additional resiliency to failures by having a disaster recovery (DR) plan. That is, when a failure in the computer network occurs, a plan is available to quickly bring the computer network back to functional status, Disaster recovery plans may include actions taken by one or more actors. For example, a recovery plan may include switching to backup systems at the location of the failure. More drastic disasters may call for switching to backup systems at a location remote from the site of the failure.
To improve response to disaster events within the computer network, control over servers and partitions within a computer network may be automated. This may be referred to as Server Control Automation (SCA). SCA may be executed on one or several control systems and conventionally uses secure connections to communicate with the many separate systems associated with a DR plan. Each connection involves initial login and status inquiries to determine the current state of the separate systems and storage. Once the state is determined, however, many changes may occur that cause the DR plan to malfunction when it is needed.
For example, the connected system may have its security parameters changed. Security groups within many organizations may change system and networking parameters in order to prevent hacking, such as by evolving best practices over time. This security tightening affects the capability of control systems to perform SCA, and this failure situation may not be noticed until a DR plan takes effect.
As another example, the profile used by a control system to perform SCA by logging on to each monitored system may have been disabled, may have had its password changed, or may have been affected in other ways. Without a correctly logged on user, the control system is not able to detect failures or orchestrate DR. In addition, because the control systems may log on and poll the system for the current state, the utilities and components it relies on (such as an underlying Software Development Kit (SDK) or Secure Shell (SSH) server) may have been uninstalled, set to a disabled state, or become nonfunctional for many other reasons. Furthermore, a monitored system may hang in such a way that the “frozen” state of the system cannot be detected.