1. Field of the Invention
The present invention generally relates to diagnostic tools for distributed computer networks and, more particularly, to a diagnostic system that uses advanced statistical methodology to build a predictive model of future behavior of a battery of probes that aids in detecting an upcoming network problem and in detecting the presence of longer term trends in the data.
2. Background Description
In the course of normal operation, a distributed computer network produces a large amount of data related to various nodes, such as servers and routers. In addition, a series of probes are deployed to submit pre-assigned tasks to various nodes of the network, and the response times of these tasks are recorded.
Managing the delivery of information services in distributed computing environments continues to be a tremendous challenge, even after more than a decade of experience.
Nevertheless, understanding of Information Technology (IT) service delivery has improved. Initially, only isolated measurements of infrastructure components, such as servers, hubs and routers, were available. Incrementally, component measurement methods have increased in sophistication, and in general meet the reporting and diagnostic requirements of their respective support groups.
However, user experience can be at variance with reported component reliability and availability. A technology was developed (in U.S. Pat. No. 6,070,190 to Reps et al., assigned to the present assignee) to answer the question: What is the customer experience? This is achieved by a client based application program monitor, using application probe software residing at the client, which records information related to the performance for the client of application services residing on a server in a distributed environment.
When a distributed environment is relatively simple, these data and analyses are sufficient to manage the environment. However, in an increasing number of cases found particularly among leaders in the use of electronic business platforms, additional management sophistication is required to gain practical understanding of the operation of the system.
Complexity arises from many sources: rapidly changing technologies; new software applications; changes in business models; and geographically dispersed employees. For example, business applications can depend on widely dispersed servers, back-end systems and wide and local area networks. Physical separation increases the dependency on Wide Area Network (WAN) technology, and WAN traffic is growing exponentially and is handled with a large variety of transport methods which must synchronize perfectly in order to receive bits at one point and deliver them reliably to another. Economics can drive provider organizations to place more than one application on a given server, each application having its own traffic rhythms and demands on the processor CPU, memory and I/O subsystems.
Global corporations are becoming ever more dependent upon a continuously available, well behaved suite of applications which underlie electronic commerce (e-commerce). Existing management systems can look backward to show what has happened, or look at the real-time environment, to show what is happening now. The new management capability required will need to examine selected information from multiple sources and vantage points, perform the necessary analyses and, based on these indicators and in conjunction with historical data, predict (in probabilistic terms) when the system is likely to experience performance degradation. Prediction is a new key capability, for it provides time to react and in the best case avoid any user-visible impact.
In the prior art, the state of the distributed system is evaluated based on its response to a stream of work generated by a battery of end-to-end transactions or probes. Such transactions can, for example, open a remote file or data base or send an e-mail message. The present invention refers to the system responsible for system probing as EPP (End-to-end Probe Platform). Every time a transaction is executed, it returns two numbers: the time it took to complete the job and the return code (used to record presence of exception conditions). The key task of system monitoring is to detect unfavorable changes in the state of the system. In the setting of the present invention it would be desirable to predict the future behavior of the EPP transactions, which represent end-user experience. If the predicted behavior is unsatisfactory, the monitoring system would then issue an alarm.
There are a number of tools used to monitor network performance that will collect data from various parts of the network and present the data on charts in real time. U.S. Pat. No. 6,446,123 “Tool for monitoring health of networks” to Ballantine et al. describes a system that predicts fail times of network components and issues alarms if a predicted behavior of a component is unsatisfactory. U.S. Pat. No. 6,359,976 “System and method for monitoring service quality in a communications network” to Kallyanpur et al. describes a system that assigns individual monitors to messages in a network, compiles the results related to individual calls into an historic report of service quality, and issues an alert based on this report. No probing or predictive statistical modeling is implemented.
The concept of probes has also been implemented in a number of settings. For example, the product Dynamic Access Network Performance Manager marketed by 3Com Corp. enables one to obtain real time summaries of the state of various servers, web sites and applications. Another product, Netuitive 5.0, uses the concept of Adaptive Correlation Engine (ACE, as described in U.S. Pat. No. 5,835,902 to Jannarone) to dynamically and continuously correlate the impacts of demand metrics (e.g. hits per second on a web page) and utilization metrics (e.g. percent of CPU utilization) with response time. U.S. Pat. No. 6,446,123 “Tool for monitoring health of networks” to Ballantine et al. describes a system that predicts fail times of network components and issues alarms if a predicted behavior of a component is unsatisfactory. The system does not implement active probing and the alerts it issues are not based on predicted behavior of probes that reflect user experience. U.S. Pat. No. 6,327,620 “Methods and apparatus for collecting, storing, processing and using network traffic data” to Tams et al. uses probes planted to monitor network traffic and introduces table formats to record such data. No state of stations assessment or predictive modeling is used. U.S. Pat. No. 6,363,056 “Low overhead continuous monitoring of network performance” to Beigi et al. describes a system for monitoring a communication network based on selecting every N-th packet of transmitted information, copying it and using it as a probe. Monitoring is implemented based on comparing the number of probes received with the number of probes sent.
A number of other approaches have been described in the prior art. U.S. Pat. No. 6,061,722 to Lipa et al. focuses on communications between a set of clients and a set of servers. The servers send “pings” to clients and use the response time to assess whether the communication paths between clients and servers are clogged. Servers also obtain information on whether a client is running any client-based process not on a list of permitted processes. U.S. Pat. No. 5,822,543 to Dunn et al. also focuses on performance monitoring of communication networks by attaching a “timing script” file to messages sent to network nodes that trigger return messages. The times that the messages take to move from one node to another are used as a basis to judge the network performance. U.S. Pat. No. 6,502,132 “Network monitoring system, monitoring device and monitored device” to Kumano et al. describes a system in which a monitoring device is connected to a plurality of monitored devices that send their status summaries to the monitoring device at the request of the latter. No probing or predictive modeling is implemented.
U.S. Pat. No. 6,574,149 “Network monitoring device” to Kanamaru et al describes a system that contains nodes that send packets of information to neighboring nodes so as to detect whether some nodes have broken away from the system. Here the emphasis is mostly on integrity of the network, not on performance. U.S. Pat. No. 6,470,385 “Network monitoring system, monitored controller and monitoring controller” to Nakashima et al. describes a system in which network devices are connected to a plurality of monitoring stations. This point to multi-point connection passes through a broadcast unit that serves as a branching point of multiple connections and is responsible for transmitting information on status of individual stations toward a plurality of monitoring stations in the system. No probing or predictive statistical modeling is used. U.S. Pat. No. 6,560,611 “Method, apparatus and article of manufacture for a network monitoring system” to Nine et al. describes a system in which tasks are being sent to various nodes in the network to establish whether a problem exists and automatically opening a service ticket against a node with a problem. This system does not use the internal information related to operation of nodes and does not involve predictive modeling. U.S. Pat. No. 6,055,493 “Performance measurement and service quality monitoring system and process for an information system” to Ries et al. discusses a monitoring system that is based on reports that are issued periodically based in indicators obtained via the proposed process of data homogenization. There is no active probing—instead the proposed system uses polling that retrieves status information from network nodes. There is no predictive modeling or automated signal triggering mechanism.
U.S. Pat. No. 6,278,694 “Collecting and reporting monitoring data from remote probes” to Wolf et al. is focused on monitoring network traffic. Here probes are defined as nodes in the network that collect measurements of the traffic flowing through them. U.S. Pat. No. 6,076,113 “Method and system for evaluating user-perceived network performance” to Ramanathan et al. is based on measuring user experience with respect to network throughput; it does not involve computing nodes or predictive modeling. U.S. Pat. No. 5,987,442 “Method and apparatus for learning network behavior to predict future behavior of communications networks” to Lewis et al. involves modeling a network as a state transition graph. The nodes of this graph represent network states and are categorized as “good” or “bad”. The system can issue an alert if a state of the graph is predicted as “bad”. It does not involve active probing. U.S. Pat. No. 6,587,878 “System, method, and program for measuring performance of network system” to Merriam et al. describes a system that obtains network data and uses this information to estimate performance time of a hypothetical device at a network address.
Prior art focusing on communication (i.e. not computer) networks include U.S. Pat. No. 5,049,873 to Robins et al. and U.S. Pat. No. 6,026,442 to Lewis et al. U.S. Pat. No. 6,070,190 to Reps et al. describes a system of probes for response monitoring and reporting in distributed computer networks; however, the state of the network is assessed based exclusively on probe responses. U.S. Pat. No. 5,542,047 to Armstrong et al. introduces a monitoring system based on circulating a status table. The states of the nodes are recorded in this table and assessed individually. No probing is implemented. U.S. Pat. No. 5,627,766 to Beaven introduces a scheme by which every network node sends test messages to every neighboring node. Injected test messages propagate on their own in the network, and the collection of recorded times is used to determine bottlenecks. U.S. Pat. No. 5,968,124 to Takahashi et al. introduces a system for collection of information produced by sub-networks and, using this information, the current state of the whole network is assessed. Another patent of this type is U.S. Pat. No. 5,432,715 to Shigematsu et al., which discloses a computer system and monitoring method for monitoring a plurality of computers interconnected within a network. Each computer in the network has a self-monitoring unit for monitoring its own computer and acquiring a monitor message, and a transmitting unit for transmitting the monitor message to a monitoring computer. U.S. Pat. No. 6,556,540 “System and method for non-intrusive measurement of quality in a communication network” to Mawhinney et al. is based on transmitting patterns of data from transmitting stations, and analyzing them at receiving stations, detecting distortions in patterns. U.S. Pat. No. 5,974,237 “Communications network monitoring” to Shumer et al. describes a system for monitoring based on data for individual network nodes. No probing or predictive statistical modeling is implemented.
Other prior art focuses on graphical systems for displaying the network status. U.S. Pat. No. 5,768,614 to Takegi et al. discloses a monitored state display unit for a monitoring system which comprises event state information processing means that requests a collecting device to collect event information and gives instructions as to the screen display method according to the state on receipt of the response notification from the collection request or the state change information from the collecting device. U.S. Pat. No. 5,463,775 to DeWitt et al. discloses a graphical resource monitor which depicts, in real time, a data processing system's internal resource utilization. This patent also focuses on reducing impact of data collection activity on performance of the system. U.S. Pat. No. 5,742,819 to Caccavale introduces a system for adjusting server parameters based on workload to improve its performance. Another system of this type is described in U.S. Pat. No. 5,793,753 to Hershey et al., which discloses a system in which programmable probes are sent form a system manager to individual workstations on the network. The probes contain programs that run on the target workstation to establish its state and change its configuration, if needed.
However, none of these prior art approaches produces a predictive model using routinely provided network data responsive to a battery of probes, the battery being designed to reflect user experience, where the object is to predict the future behavior of this battery of probes, with thresholds set in such a manner that false alarms can be limited to a predictably low rate, to anticipate upcoming degradation in the network as measured by future performance of the battery of probes.