1. Field of the Invention
The present invention relates, in general, to computing methods for remotely diagnosing faults, errors, and conditions within a grid-based computing system. More particularly, the present invention relates to rule based processes and computing environments for probabilistic identification of potential root causes of faults, errors and other conditions within a grid-based computing system.
2. Relevant Background
Grid-based computing utilizes system software, middleware, and networking technologies to combine independent computers and subsystems into a logically unified system. Grid-based computing systems are composed of computer systems and subsystems that are interconnected by standard technology such as networking, I/O, or web interfaces. While comprised of many individual computing resources, a grid-based computing system is managed as a single computing system. Computing resources within a grid-based system each can be configured, managed and used as part of the grid network, as independent systems, or as a sub-network within the grid. The individual subsystems and resources of the grid-based system are not fixed in the grid, and the overall configuration of the grid-based system may change over time. Grid-based computing system resources can be added or removed from the grid-based computing system, moved to different physical locations within the system, or assigned to different groupings or farms at any time. Such changes can be regularly scheduled events, the results of long-term planning, or virtually random occurrences. Examples of devices in a grid system include, but are not limited to, load balancers, firewalls, servers, network attached storage (NAS), and Ethernet ports, and other resources of such a system include, but are not limited to, disks, VLANs, subnets, and IP Addresses.
Grid-based computing systems and networking have enabled and popularized utility computing practices, otherwise known as on-demand computing. If one group of computer users is working with bandwidth-heavy applications, bandwidth can be allocated specifically to them using a grid system and diverted away from users who do not need the bandwidth at that moment. Typically, however, a user will need only a fraction of their peak resources or bandwidth requirements most of the time. Third party utility computing providers outsource computer resources, such as server farms, that are able to provide the extra boost of resources on-demand of clients for a pre-set fee amount. Generally, the operator of such a utility computing facility must track “chargeable” events. These chargeable events are primarily intended for use by the grid-based computing system for billing their end users at a usage-based rate. In particular, this is how the provider of a utility computing server farm obtains income for the use of its hardware.
Additionally, grid-based systems must monitor events that represent failures in the grid-based computing system for users. For example, most grid-based systems are redundant or “self-healing” such that when a device fails it is replaced automatically by another device to meet the requirements for the end user. While the end user may not experience any negative impact upon computing effectiveness, it is nevertheless necessary for remote service engineers (“RSE”) of the grid system to examine a device that has exhibited failure symptoms. In particular, a RSE may need to diagnose and identify the root cause of the failure in the device (so as to prevent future problems), to fix the device remotely and to return the device back to the grid-based computing system's resource pool.
In conventional operation of a grid-based computing system, upon an indication of failure, a failed device in the resource pool is replaced with another available device. Therefore, computing bandwidth is almost always available. Advantages associated with grid-based computing systems include increased utilization of computing resources, cost-sharing (splitting resources in an on-demand manner across multiple users), and improved management of system subsystems and resources.
Management of grid-based systems, due to their complexity, however can be complicated. The devices and resources of a grid-based system can be geographically distributed within a single large building, or alternatively distributed among several facilities spread nationwide or globally. Thus, the act of accumulating failure data with which to diagnose and address fault problems in and of itself is not a simple task.
Failure management is further complicated by the fact that not all of the information and data concerning a failure is typically saved. Computing devices that have agents running on them, such as servers, can readily generate and export failure report data for review by a RSE. Many network devices, such as firewalls and load balancers, for example, may not have agents and thus other mechanisms are necessary for obtaining failure information.
Further, the layout and configuration of the various network resources, elements and subsystems forming a grid-based system typically are constantly evolving and changing, and network services engineers can be in charge of monitoring and repairing multiple grid-based systems. Thus, it is difficult for a network services engineer to obtain an accurate grasp of the physical and logical configuration, layout, and dependencies of a grid-based-system and its devices when a problem arises. In addition, different RSEs, due to their different experience and training levels, may utilize different diagnostic approaches and techniques to isolate the cause of the same fault, thereby introducing variability into the diagnostic process.
In this regard, conventional mechanisms for identifying, diagnosing and remedying faults in a grid-based system suffer from a variety of problems or deficiencies that make it difficult to diagnose problems when they occur within the grid-based computing system. Many hours can be consumed just by a RSE trying to understand the configuration of the grid-based system alone. Oftentimes one or more service persons are needed to go “on-site” to the location of the malfunctioning computing subsystem or resource in order to diagnose the problem. Diagnosing problems therefore is often time consuming and expensive, and can result in extended system downtime.
When a service engineer of a computing system needs to discover and control diagnostic events and catastrophic situations for a data center, conventionally a control loop is followed to constantly monitor the system and look for events to handle. The control loop is a logical system by which events can be detected and dealt with, and can be conceptualized as involving four general steps: monitoring, analyzing, deducing and executing. In particular, the system or engineer first looks for the events that are detected by the sensors possibly from different sources (e.g., a log file, remote telemetry data or an in-memory process). The system and engineer use the previously established knowledge base to understand a specific event it is investigating. Next, when an event occurs, it is analyzed in light of a knowledge base of information based on historically gathered facts in order to determine what to do about it. After the event is detected and analyzed, one must deduce a cause and determine an appropriate course of action using the knowledge base. For example, there could be an established policy that determines the action to take. Finally, when an action plan has been formulated, it's the executor (human or computer) that actually executes the action.
This control loop process, while intuitive, is nonetheless difficult as it is greatly complicated by the sheer size and complicated natures of grid based computing systems.
Currently, inferential and deductive reasoning algorithms are being explored for use in various applications where root causes need to be inferred based on observed evidence. One such application is fault diagnosis of complex systems based on symptoms reported by various diagnostic monitors employing a variety of techniques to gather and normalize data for inferring the cause of the problem. It would also be useful if such algorithms could be used for performing the reverse analysis. For a known cause, one might want to find a possible resultant condition that can be construed as a “faulted” condition. Environments such as customer problem resolution in the field can benefit from “inter-causal” analysis, so effects of one cause can be determined on one or more other causes or the final result.
Unfortunately, inferential algorithms have not been adequately adapted for use in fault analysis of grid based computing systems. Thus, there remains a need for improved computing methods for assisting service engineers in remotely diagnosing faults, errors, and conditions within a grid-based computing that effectively takes advantage of inferential and deductive reasoning algorithms.