1. Field of the Invention
The present invention relates, in general, to data analysis and representation methods and related computing operations for remotely diagnosing faults, errors, and conditions within a computing system containing various devices and network resources. More particularly, the present invention relates to a data processing and visualization methodology that assists administrators of such computing systems in diagnosing and addressing faults, errors and other conditions within the computing system.
2. Relevant Background
Modern computing utilizes system software, middleware, and networking technologies to combine independent computers and subsystems into a logically unified system. Contemporary complex computing systems are composed of computing devices, resources, and subsystems that are interconnected by standard networking technology. While comprised of many individual computing resources, a networked computing system may be utilized as a single computing system. Computing resources within a system each can be configured, managed and used as part of the larger network, as independent systems, or as a sub-network. Typically, individual subsystems and resources of the network system, and commercial scale systems in particular, are not fixed as the overall configuration of the system may change over time. Computing system resources can be added or removed from the computing system, moved to different physical locations within the system, or assigned to different groupings or farms at any time. Such changes can be regularly scheduled events, the results of long-term planning, or virtually random occurrences. Examples of devices in a network system may include, but are not limited to, load balancers, firewalls, servers, network attached storage, and Ethernet ports, and other resources of such a system include, but are not limited to, disks, VLANs, subnets, and IP Addresses.
Computing systems and networking have made possible on-demand computing practices whereby one group of computer users of a network working with bandwidth-heavy applications may be allocated bandwidth while bandwidth is likewise diverted away from other users of the network who do not need the bandwidth at that moment. Third party utility computing providers outsource computing resources in on-demand fashion (such as external server farms) to provide the extra boost of resources on-demand to clients for a pre-set fee amount. Generally, the operator of such a utility computing facility must track certain events (e.g., usage, etc.) to determine fees. These types of events are primarily intended for use by the computing system for billing their end users at the usage-based rate. In particular, this is how the provider of a utility computing server farm obtains income for the use of its hardware. Advantages associated with on-demand computing systems include increased utilization of computing resources, cost-sharing (splitting resources in an on-demand manner across multiple users), and improved management of system subsystems and resources.
Additionally, such complex contemporary computing systems also must monitor events that represent failures in the computing system for users. For example, most complex computing systems are redundant or “self-healing” such that when a device fails it is replaced automatically by another device to meet the requirements for the end user. Therefore, computing bandwidth is almost always available. While the end user may not experience any negative impact upon computing effectiveness, it is nevertheless necessary for service engineers of the computing system to examine a device that has exhibited failure symptoms. In particular, a service engineer may need to diagnose and identify the root cause of the failure in the device (so as to prevent future problems), to fix the device remotely and to return the device back to the computing system's resource pool.
Management of such complex computing systems is not an easy task. The devices and resources of a system can be geographically distributed within a single large building, or alternatively distributed among several facilities spread nationwide or globally. Typically, service engineers in an operations center spend a large portion of their time fixing problems associated with the events as opposed to considering and diagnosing the systematic issues that may cause the problem. Thus, the act of accumulating failure data with which to diagnose and address fault problems in and of itself is not a simple task.
Current network and systems management tools typically represent event data, including failure or error events, chargeable events, and other monitored events, to network service engineers in table form or encode those events by the color of an icon on a map, which map in turn various devices and resources in the network and the physical connectivity between devices and resources. The network service engineers then use the information provided by these management tools to identify and/or diagnose problems within their network and to direct the efforts of on-site service personnel in repairing suspected problems creating the events. Some of theses management tools have correlation engines that apply logic trees to assist the network service engineers in diagnosing root causes of the events, but such engines function in a deterministic manner whereby all underlying relationships between the devices must be known. The correlation engines provide service engineers with an automatically generated table of or list of one or more events with associated potential diagnoses of what root causes are for particular events.
Network services engineers can be in charge of monitoring and repairing multiple systems, making network management tools with correlation engines attractive. Unfortunately, correlation engines, while intriguing in theory, often fall short in reality because the layout and configuration of the various network resources, elements and subsystems forming a complex network system typically are constantly evolving and changing. In addition to detailed event data, such correlation engines must be provided with detailed configuration data describing accurately the physical and logical configuration, layout, and dependencies of the network system and its devices at the time of the event in order to operate properly. To keep the correlation engine in working order, engineers must monitor changes within elements of a network not only to make certain that the remedial actions won't cause problems with the operation of the network, but also to ensure that a failure to provide updated configuration information to the system management tool will not “break” the correlation engine. Unfortunately, it is difficult to maintain the input data because may engineers can make various changes to the network without knowledge of or permission of those engineers maintaining the input and configuration data. In such cases, correlation algorithms that are supposed to simplify the jobs of network service engineers can actually consume comparatively more of the service engineers' time.
Failure management also is often complicated by the fact that not all of the information and data concerning a failure may be saved. Computing devices that have agents running on them, such as servers, can readily generate and export failure report data for review by a management system or service. Many network devices, such as firewalls and load balancers, for example, may not have agents and thus other mechanisms are necessary for obtaining failure information. Additionally, such failure information, even when it is supposed to be tracked, may be lost in conjunction with a failure.
Event displays currently provided by most contemporary network management tools typically utilize a table format that can be sorted by different fields. In essence, the table comprises an extensive list of all events within the network, and includes information such as an event ID, time, type, severity, associated devices or subsystems, and the like. The shear number of events represented in these tables makes them unsuitable for aiding network service engineers in determining root causes for events or otherwise identifying common underlying factors or problems. In essence, the tables by themselves provide too much information to be useful by an unaided human in identifying and diagnosing systematic problems within the network.
Conversely, some network management tools utilize event displays where graphical topology maps are used to encode event severity, the nested nature commonly employed in such maps hide underlying causes of the events making it unclear as to the root cause. Graphical maps that represent services greatly reduce this issue but there again intimate knowledge of the underlying relationships of devices and network elements to services is needed for the approach to be effective.
In this regard, conventional mechanisms for tracking, reporting, identifying, diagnosing and remedying faults in a complex computing system suffer from a variety of problems or deficiencies that make it difficult to diagnose problems when they occur within the computing system. Current network and systems management tools merely relay event information in table format or encode event information within an object in a topology map providing little to no insight regarding possible device and resource problems causing the events. Correlation algorithms in theory may be used to help predict likely root causes which are then also displayed to a network service engineer, however, many hours can be consumed by a service engineer merely trying to understanding or keep track of the configuration of the system alone just to maintain the correlation engine in working order. Furthermore, in any event, once problems are detected it is oftentimes necessary for one or more service persons to go “on-site” to the location of the malfunctioning computing subsystem or resource in order to further diagnose and/or remedy the problem. Diagnosing fault events and other like problems therefore is often time consuming and expensive, and can result in extended system downtime.
Thus, there remains a need for improved computing methods for remotely reporting and diagnosing faults, errors, and other event conditions within a complex computing system.