Deployment of networked computer services, such as Internet search engines, email, messaging and other communications platforms and others continue to expand and proliferate. Those and other services require Internet service providers (ISPs) and others to deploy increasingly capable back-end infrastructure to support the range and responsiveness of service expected by consumers, businesses and others. Installation of those resources, such as server farms, high-volume databases and others, in turn leads to demands for increased platform connectivity and results in greater dependency on all service components to cooperate effectively to deliver the search or other services.
However, in an extensive installation such as a server farm arranged to support and Internet search engine or other application, the interdependency of numerous machines, connections and software may lead to faults or performance degradation in user-side performance when any one or more of the component resources crashes or becomes otherwise inoperable. For example a collection of servers which access travel, hotel and other remote data sources may hang or crash when executing a search on “Hawaii” or other terms when connections to one or more remote databases break or degrade. A user viewing a search page may therefore be presented with a blank screen, 404 error or other interruption of failure notification. This may occur even when other components, connections or data sources are still functioning and could perhaps return data to be presented to the user.
In network service installations, to address that type of service interruption some operators may choose to install network monitoring packages which generate alerts to systems administrators, to advise them for example the processor utilization has become dangerously high on one group of servers, or that a backbone connection to a data source has broken down. This may permit the system administrator or other to step in and manually adjust communications links, activate redundant servers or take other actions. However, such arrangements still require the intervention and judgment of a human operator to sense and balance network performance in the presence of faults and other conditions. This among other things may lead to errors in judgment or a response time which is not acceptable or optimal during urgent network outages or conditions. Moreover human operators may only have the ability to monitor and act on a fairly limited number of connections or other resources for emergency override purposes. Other problems in the management of networked computer services and the reliable operation of those services exist.