One of the major challenges in computer operations is how to maintain high availability of applications to user entities. As more mission critical services become remotely accessible, and as more businesses become intertwined in mesh-like networks, the need for better ways to insure high availability has grown more pressing.
Past solutions for maintaining high availability have predominantly focused on increasing hardware and software reliability. However, once a program stopped working or crashed, the common solution offered was a local reboot of the affected platform. When the affected platform is remote yet can only be manually restarted, availability is lost for hours at a time. Even systems where there is a local administrator can be down long enough to impact operations, particularly if the administrator is not immediately notified or available to tend to the outage.
Another factor affecting the ability to deliver high availability is the serviceability of a system's components. Many software applications need constant upgrades or patches. Frequently, the application programs must be restarted after the changes have been made in order for program to work in its modified state. If the application is on a remote platform, the inability to automatically restart an application may mean that necessary changes have to be deferred until qualified technicians can visit the remote site. But even with local systems, downtime may be adversely affected if technicians are not able to stand by and monitor for when an upgrade or patch is done and the application ready for restart.
One solution to these problems for systems needing high availability is the use of duplicate or mirrored platforms, sometimes running constantly in a “hot-swappable” configuration. While this does address many of the problems noted above, the implementation can be complex and cost prohibitive. Thus, this is not a viable option for most systems.
Another approach that has been used to provide a limited remote start functionality can be found in Borland's VisiBroker® object start/deferred start capability. This feature operates in a CORBA Object Request Broker (ORB) runtime environment via an object activation daemon (OAD). The OAD is an implementation of the CORBA Implementation Repository, providing a runtime repository of information about the classes a server supports, the objects that are instantiated, and their IDs, and may be used to automatically activate an implementation when a client references an object registered with the OAD. This latter feature reduces overhead by allowing servers that implement objects for client applications to be started on demand, rather than running continuously. However, because of its ORB architecture, this activation functionality will not work across the internet. Each object implementation must also be a child process of the OAD process, with all environment variables passed into the OAD.
JMX, or Java Management Extensions, offers yet another approach to remotely activating components. JMX operates by instantiating a management agent within a JVM (Java virtual machine), this agent having a MBean server instance, an adapter and a set of services. The agent can effectively change the state of a component (e.g., to start or stop it) by controlling the MBean server to pass messages based on start or stop requests. However, this is a Java-specific implementation, and a key weakness is its reliance on the agent running within a JVM environment. If the JMX agent or JVM is down, there is no way to restart the adapter (agent) or dependent services.
Thus, while these two programs have been designed with the ability to remotely start or stop other registered objects, these are limited to control of child processes (activated via an ORB OAD), or to control of clients via an agent server instance, and these fail to provide or suggest automated approaches for restarting agents or remote applications that have lost connectivity. Other solutions, like manual intervention or hot-swappable mirror sites are too complex, expensive, and/or time consuming to be widely adopted. Thus, there remains a need for a better way to increase availability and serviceability of networked applications.