The invention relates generally to monitoring and managing complex data network environments, and is particularly suitable for analyzing and diagnosing problems in an E-business system.
Component integration and other design issues have received a significant amount of attention in network settings, especially those used for electronic business (E-business). In the global communications network referred to as the Internet, portal sites have been created for enabling business-to-business transactions, business-to-consumer transactions, and consumer-to-consumer transactions. Much of the effort has been in the area of integration, so that a single system includes the backend databases used in the ordering procedure, the order fulfillment capability, and the payment processing capability. Another area that has received considerable attention is load balancing to ensure that one component does not act as a xe2x80x9cbottleneckxe2x80x9d for activity.
FIG. 1 illustrates one possible embodiment of an E-business system. To ensure redundancy, the system uses multiple Internet Service Providers (ISPs) 10, 12, and 14 to connect to the Internet. An access router 16 manages the connectivity to the ISPs. At least one load balancer 18 is responsible for receiving user requests via the ISPs and directing the requests to one of the available web servers 20, 22 and 24 used by the system. The web servers forward the incoming requests to the appropriate E-business applications. The E-business applications execute on middleware platforms commonly referred to as application servers 26 and 28. A firewall 30 is used to provide security.
The application servers 26 and 28 enable a number of features from which different applications can benefit. These features include optimization of connections to database servers 32, 34 and 36, caching of results from database queries, and management of user sessions. Data that is indicative of user information, a catalog of goods, pricing information, and other relevant information for the E-business system is stored in the database servers and is available for access by the application components. To process payments for goods or services by users, the system maintains connections to at least one remote payment system 38. Links to shipping agencies 40 are also provided, so as to enable the E-business system to forward the goods for shipping as soon as an order is satisfied.
Also shown in FIG. 1 are a Domain Name Service (DNS) server 42 and a Wireless Application Protocol (WAP) server 44, and Lightweight Directory Access Protocol (LDAP) server 45. As is known in the art, the DNS server is accessed to provide users with the Internet Protocol (IP) address. The WAP server may be used for frontending applications accessed via wireless devices such as mobile phones and Personal Digital Assistants (PDAs), while the LDAP server is used for storing and retrieving information in a directory format.
As compared to the emphasis on design issues of the E-business system, monitoring and managing issues for such systems have received significantly less attention. Many systems are managed using ad-hoc methods and conventional server and network monitoring systems, which are not specifically designed for an E-business environment. As a result, the monitoring capabilities are limited.
Since the business applications of a system rely on application servers for their operation, the application servers 26 and 28 are in a strategic position to be able to collect a variety of statistics regarding the health of the E-business system. The application servers can collect and report statistics relating to the system""s health. Some of the known application servers also maintain user profiles, so that dynamic content (e.g., advertisements) generated by the system can be tailored to the user""s preferences, as determined by past activity. However, to effectively manage the system, monitoring merely at the application servers is not sufficient. All the other components of the system need to be monitored and an integrated view of the system should be available, so that problems encountered while running the system (e.g., a slowdown of a database server or a sudden malfunction of one of the application server processes) can be detected at the outset of the problem. This allows corrective action to be initiated and the system to be brought back to normal operation.
FIG. 2 illustrates monitoring components as used with the E-business system of FIG. 1. The core components for monitoring include a manager 46, internal agents 48, 50 and 52, and one or more external agents 54. The manager of the monitoring system is a monitoring server that receives information from the agents. The manager can provide long-term storage for measurement results collected from the agents. Users can access the measurement results via a workstation 56. For example, the workstation may be used to execute a web-based graphical user interface.
As is known in the art, the agents 48, 50, 52 and 54 are typically software components deployed at various points in the E-business system. In FIG. 2, the internal agents are contained within each of the web servers 20, 22 and 24, the application servers 26 and 28, and the LDAP server 45. By running pseudo-periodic tests on the system, the agents collect information about various aspects of the system. The test results are referred to as xe2x80x9cmeasurements.xe2x80x9d The measurements may provide information, such as the availability of a web server, the response time experienced by requests to the web server, the utilization of a specific disk partition on the server, and the utilization of the central processing unit of a host. Alternatively, tests can be executed from locations external to the servers and network components. Agents that make such tests are referred to as external agents. The external agent 54 is shown as executing on the same system as the manager 46. As previously stated, the manager is a special monitoring server that is installed in the system for the purpose of monitoring the system. The external agent 54 on the server can invoke a number of tests. One such test can emulate a user accessing a particular website. Such a test can provide measurements of the availability of the website and the performance (e.g., in terms of response time) experienced by users of the website. Since this test does not rely upon any special instrumentation contained within the element being measured, the test is referred to as a xe2x80x9cblack-box test.xe2x80x9d
Often, it is more efficient to build instrumentation into the E-business elements and services. For example, database servers 32, 34 and 36 often support Simple Network Management Protocol (SNMP) interfaces, which allow information to be obtained about the availability and usage of the database server. An external agent, such as agent 54, may execute a test that issues a series of SNMP queries to a particular database server to obtain information about the server""s health. Since such a test relies on instrumentation built into the database server, tests of this type are referred to as xe2x80x9cwhite-box tests.xe2x80x9d
External agents 54 may not have sufficient capability to completely gauge the health of an E-business system and to diagnose problems when they occur. For example, it may not be possible to measure the central processing unit utilization levels of a web server from an external location. To accommodate such situations, the monitoring system can use the internal agents 48, 50 and 52.
In the presently available manager-agent architectures for network monitoring, each measurement is associated with a state. The term xe2x80x9cstatexe2x80x9d is defined herein as being synonymous with xe2x80x9chealth.xe2x80x9d The state of a measurement is computed by comparing the results of the measurement with pre-specified thresholds. When a measurement exceeds its threshold, the state of the measurement is changed to indicate that a problem has occurred and an alarm is generated to the user. The alarm may be displayed on a separate window of the user interface run at the workstation 56. Alternatively, an e-mail or pager message can be automatically generated to alert the user of the problem.
To facilitate problem diagnosis, some monitoring systems use the notion of a xe2x80x9cservice model.xe2x80x9d The service model is a tree-structured representation of the various components of a service that is provided by the system, as well as all of the interdependencies among the components. Within a representation of the service model, each host, process and server is indicated as a node. The different nodes are interconnected based on the node interdependencies. For example, a node representing a web server process may be connected to a web service node, since the state of the web server process node affects the web service node. According to this model, the state of the web server process node is determined on the basis of measurements associated with the node. In turn, the state of the web service node is determined on the basis of the state of the web server process node. A user must manually walk through the service model to determine the source of a problem.
There are a number of concerns with the known approaches for monitoring and managing a data network, such as an E-business system. One concern is that problems are typically reported individually. That is, alarms are generated based on individual measurements. Since there are numerous dependencies among processes, network elements and applications, a single problem in an E-business environment can result in several related alarms being generated. For example, a slowdown in a database server 32, 34 and 36 of a website can result in more connections accumulating at the web application servers 26 and 28. In turn, the web application servers can cause the web servers 20, 22 and 24 to slowdown. Ultimately, the slowdown of the database server can result in a denial of accesses to the website. As is evident from this example, when a problem occurs, the user of the monitoring system can be presented with a large number of alarms. This requires the user to wade through and correlate manually in order to identify the cause of the problem or problems. This may be time intensive. Moreover, a detailed understanding of the topology of the system is required in order to determine the location of the root cause of the problem.
The service model approach attempts to assist the manual diagnosis. By walking the service model graph in a top-down fashion, the user can determine the bottom-most problem node. The main drawback of the service model approach is that it uses a hierarchical approach to diagnosis. Therefore, the peer relationships that exist in many E-business environments (e.g., two websites may be hosted on the same web server, so that the two websites are peers to one another) must be cast in hierarchical relationship. Human operators may struggle to map the two-dimensional topology model to the uni-dimensional service model, and therefore find it difficult to use service models to comprehend the causes of problems.
What is needed is a method that simplifies and/or automates the process of identifying a root cause of an operational problem in a network environment, such as an E-business system.
A network monitoring method includes storing topology information and mapping information that allow root causes of network problems to be efficiently ascertained. In one embodiment of the invention, network health is monitored using a web-based user interface that enables navigation of health conditions of network components and protocol layers. In another embodiment, the topology information and mapping information are employed to enable automated correlation between detected xe2x80x9cbadxe2x80x9d states and root causes, as well as automated prioritization of generated alerts for notifying personnel of detected xe2x80x9cbadxe2x80x9d states.
As one key to the correlation methodology, a physical topology representation is generated. The physical topology is a mapping of the interconnections among network components. The interconnections represent physical connections and logical dependencies among the components. Preferably, at least some of the interconnections are associated with a direction that signifies a cause-and-effect relationship between the connected components.
In the preferred embodiment, a logical topology representation is also generated, since the physical topology does not consider websites. The logical topology maps each website to the components which support the website. The logical topology maps a website to at least one web server, with the website inheriting the physical topology interconnections of the web server. Thus, the logical topology of a particular website is a subset of the physical topology. As is known in the art, a website offers one or more services to users who access the website. The various services that are available via a website are referred to herein as xe2x80x9ctransactions.xe2x80x9d
A hierarchy of protocol layers is identified, with the hierarchy being based on interdependencies among the protocol layers with regard to implementing functions. That is, the protocol layers are related to component functionalities and are ranked according to functionality-to-functionality dependencies for implementation. As one example, a web transaction layer is dependent upon support from a website layer, which is dependent upon support from a web server layer.
Each network component is mapped to the protocol layers on the basis of the functionalities of the network component. Moreover, measurements from various available network tests are mapped to the protocol layers on the basis of relationships between the measurements and the health of the protocol layers. xe2x80x9cHealthxe2x80x9d will be used herein as being synonymous with the operational state of the component, protocol layer, website, transaction, or measurement which it modifies.
The health of the data network can be monitored by utilizing the collection of topology information and mapping information. In one embodiment, the monitoring is performed using a web-based user interface that displays health conditions of the components and the protocol layers of the components, using the mapping of the measurements to the protocol layers as a basis for displaying the health conditions. The web-based user interface enables navigation through the information that is indicative of the present operating states of the components, the present operating states of the websites, the present operating states of the protocol layers as mapped to the components, and the present states of the measurements. A user of the methodology is able to xe2x80x9cdrill downxe2x80x9d to the root cause of a problem by navigating through the xe2x80x9cbadxe2x80x9d states of the components, websites, protocol layers, transactions, and measurements.
In another embodiment, automatic correlation generates alerts regarding the xe2x80x9cbadxe2x80x9d states. The automated correlation process includes prioritizing the alerts on a basis of identifying a root cause of a problem that resulted in one or more measurements being determined to be undesirable. A user of the process is presented with a display that indicates the priority of the alerts.