1. Field of the Invention
The present invention generally relates to operation of network-connected computers running similar or diverse applications and, more particularly, to error and failure detection and automatic failure and error recovery in a transparent manner together with improved communication routing from any network-connected computer to any other.
2. Description of the Prior Art
In recent years, it has become much more common for large endeavors relying on generally large but highly variable requirements for data processing power to utilize a potentially large number of network-connected data processors which may be widely distributed geographically and which may or may not be collectively operating as a larger system rather than a large so-called mainframe data processor for a large variety of reasons. For example, the human resources required for the endeavor may be widely distributed geographically with many resources available for substantial periods of time which would otherwise be largely unused. Moreover, much of the resulting data may be principally useful only locally with periodic summaries and status reports being sufficient for oversight of the progress of the overall endeavor; thereby reducing communication overhead as compared with utilization of a central mainframe data processor. It is also common at the present time to configure computers with comparatively greater computing power as a plurality of independent data processors.
In this regard, networked data processors are often connected in so-called clusters (e.g. comprising a plurality of independent systems which may be actual independent hardware or logical systems made from partitions across one or more larger systems, such as mainframe systems, and which are networked together through TCP/IP over Ethernet™ or some other physical network connection) to improve efficiency of utilization of communication resources as well as to improve availability of information to persons with a need for it.
However, some diversity of applications among network connected computers is inevitable when it is sought to utilize the data processing capability of existing resources. Such diversity carries issues of compatibility and some complications resulting from possible failure of particular network-connected data processors as well as recovery from errors and failures in both communications and/or particular data processor failures.
As a result, applications must currently be coded to work properly as part of some existing cluster design and on a specific platform if they are to have adequate failover recovery/avoidance built into the applications. Failover generally has connoted an arrangement where so-called hot standby machines are provided in a redundant manner for each or most active processors. When failure occurs or is detected, the hot standby machine assumes the processing load and network connections of the failed processor and, in effect, substitutes itself for the failed processor in the network. However, the change-over processing to do so generally requires a period of five to ten minutes or more subsequent to the detection of a failure and such a back-up system of redundant processors virtually doubles the hardware requirements of the networked system. For applications which have not been specifically coded as part of a cluster, system and application failure is handled through monitoring which triggers the failover/processor substitution process and notification for manual or partially automated recovery. However, at the present state of the art, automation of recovery is limited to implementation as a primary/back-up configuration where data processing results are mirrored on another closely associated or dedicated data processor such as the failover arrangement described above. It should be recognized that failover arrangements using a hot standby machine are also subject to the complications presented by the active machine and the hot standby machine being differently configured while the only way to be certain of good failover performance (nevertheless requiring excessive time) is to induce a failure and correct any problems then encountered in the failover processing.
Therefore, monitoring for error detection often requires separate communication facilities and thus cannot adequately monitor the communication facilities themselves or directly provide error or failure recovery of communication facilities while restoration of operation from backed up data and state information on a different processor requires an often an extended period of time (e.g. several seconds to several minutes). During such time, there is often no alternative for the remainder of the system beyond waiting for the recovery of the failed resource(s). Thus, at the present state of the art, while error or failure recovery techniques for specific resources may be relatively well-developed, monitoring for errors and failures of resources requires substantial hardware and processing overhead while actual error and failure recovery may require protracted periods of time comprising utilization of other fully functional back-up resources only on an as-needed basis while being far from transparent to users.