The present invention, generally, relates to network communication methods, systems and computer program products and, more particularly, to systems, methods and computer program products for performance testing of computer networks.
Companies are often dependent on mission-critical network applications to stay productive and competitive. To achieve this, information technology (IT) organizations preferably provide reliable application performance on a 24-hour, 7-day-a-week basis. In this demanding environment, frequent network expansion and day-to-day fire fighting often leave little time for IT managers to manage network performance proactively. Accordingly, IT managers typically rely on some form of system management and/or network management tools to help automate performance management tasks to increase the efficiency of the IT staffs. Even with these tools, IT staffs are typically required to commit resources to integrating and customizing the tools to work in the heterogeneous network environments which may include, for example, hardware from a variety of vendors, executing a variety of operating systems and communication protocols and supporting a variety of application programs for different end user requirements.
Various known system management products track specific information, such as the CPU utilization on a server, server paging and disk access activity, or client application response time. This information may be useful when solving problems on a network. These products can generally be categorized as passive systems or application monitors. This category typically is implemented as software agent technologies that reside on the client or server computers. They generally passively monitor live application transactions and monitor resource utilization. Products in this category include Patrol from BMC Software, Inc., FirstSense Agent from FirstSense Software, Inc., VitalAgent from INS, Luminate Software Corp., and Envive Corp. As they are passive application monitors, they typically support specific application programs. For example, Luminate Software and Envive support the SAP R/3 application. Their measurements are generally neither consistent nor repeatable, as a user""s interaction with a given application varies over time. Moreover, they are typically not suited to detecting system slowdowns or failures from the perspective of an end user. Operations for one such passive monitor are described in xe2x80x9cCharacterizing End-to-End Performance: A VitalSigns Whitepaper,xe2x80x9d VitalSigns Software, Inc. 1998.
Another approach to passive monitoring is directed to the network infrastructure rather than the overall system. On the network side, element managers or passive network monitors are known which may address a specific segment or device on the network. Element managers are generally software designed to manage specific groups of devices, such as routers and switches. Passive network monitors are typically a combination of hardware and software that may, for example, monitor network traffic at the link layer or at the infrastructure devices. Products falling in this category include remote monitor (RMON) probes from NetScout Systems, Inc., Sniffer from Network Associates, NetMetrix from Hewlett-Packard, Application Expert from Optimal Networks Corp., EcoSCOPE from Compuware Corp., and Visual OnRamp from Visual Networks, Inc. These network management tools typically provide information such as packet loss, bit rates, and network utilization. This type of information may be helpful in fixing a network problem after the problem has been identified. However, as with the passive system monitors, these tools generally do not reflect network performance as experienced by a user. These tools are passive, in that they generally watch the network traffic which traverses a network segment or link, rather than actively creating traffic.
Passive network monitors sometimes include a basic scheduler to collect sample data from their data sources. A basic scheduler generally merely specifies the frequency (e.g., once every 15 minutes) at which the management console of the monitor should collect data from the data sources. Passive monitors are limited in that they are typically expensive to scale, only see traffic that is on the network at the time. Also, if an anomaly event occurs, it is often desirable to collect performance data at the time of the anomaly event. Approaches limited to scheduled collection typically do not address this need.
Another category of system management tool is active application monitors. These are products that generally measure performance by actively emulating application transactions. These transactions are often referred to as xe2x80x9csyntheticxe2x80x9d transactions. Products in this category include Ganymede Software Inc.""s Chariot(copyright) and Pegasus(trademark) products, as described in U.S. Pat. Nos. 5,838,919, 5,881,237 and 5,937,165, VeriServ from Response Networks, Inc. and SLM from Jyra Research Inc. VeriServ allows an operator to define the types of applications to be monitored, times and days, and the end user locations from which the transactions are to originate. The operator may also choose to define alarm thresholds. Agents installed at the end user location monitor actual sample application transactions to measure performance of the applications operating over the network environment. VeriServ automatically tests applications at a fixed interval. SLM provides the flexibility for the user to schedule synthetic transactions for any interval from 5 minutes to a year. However, as these approaches are also typically directed to a particular application and require that the applications be installed and operating to generate network traffic, they generally only address simple web and database transactions. Also, any new or custom applications may require extensive configuration by the users to allow the tester to interact with the applications. In addition, active network testers add traffic to the communication network being tested, thereby using network resources which would otherwise be available for users.
A further tool available to IT staffs in many network environments, such as client-server networks supporting the Internet Protocol (IP), is the traceroute utility. It is known to IT staffs in such environments that, on receipt of a performance complaint from a user, the IT staff may manually execute a traceroute between the client device of the complaining user and the server device associated with the network communication flows related to the complaint. A traceroute may provide an identification of each of the devices in the network connection path between the client and the server at the time of execution of the traceroute utility by the IT staff.
As the range of information available to IT staffs from network performance tools increases, IT staffs face increasing challenges in attempting to analyze the large volumes of resulting data to identify and respond to problems promptly. The increasing complexity of networks and the variety of applications and users utilizing those networks in a client-server environment makes the challenge even greater for IT staffs. These problems are further exacerbated as networks are typically not static as new hardware and software application programs may be periodically added thereby changing the traffic characteristics on the network and the end user""s experience of network performance. It is increasingly important to analyze the actual performance of the network to be tested without the constraints and limitations of these existing tools. It would also be beneficial to provide network performance tools that reduce the level of expertise about network topology required of IT personnel.
The present invention provides methods, systems and computer program products for tracking network device performance which, in various embodiments, may track device performance by acquiring and storing routing information for communication connections over the network on a scheduled basis during normal operations and responsive to exception events. Network performance measurements may be obtained on a repeated basis, for example, pursuant to a test schedule. The performance measurements may be obtained from either active or passive testing. A traceroute may be initiated for a plurality of connections (for example, client to server) on a repeated basis, for example, a periodic basis, and the performance measurements for each connection for the same time period may be associated with the detected routing information to provide baseline information relating to the performance of one or more routes which support each connection. In addition, the network performance measurement system may detect exception events based on the performance measurements and initiate traceroutes responsive to detected exception events and associate the exception events with the detected routing information as well. While performance measurements may provide end-to-end views of a connection, the routing information may provide insights into the network infrastructure. Accordingly, IT staffs may be provided the opportunity to compare different routings to detect relatively poor performing routings or outages (if routing is not complete between the first and second devices) and identify the network devices included in those routings (optionally, along with latency between hops and/or error information from the traceroute).
In one embodiment of the present invention, network device performance may be tracked. Network performance measurements are repeatedly obtained for a communication connection between a first device and a second device. A routing associated with the obtained network performance measurements are repeatedly determined, the determined routings being defined by a set of network devices establishing the corresponding communication connection. In addition, in one embodiment, additional information, such as the latency between hops or error information, for example when an outage or timeout is detected, may be obtained during the traceroute. The determined routings and associated network performance measurements are stored to provide baseline information related to performance of the determined routings. The routing in one embodiment related to an IP network includes running a traceroute between the first device and the second device within a determined time period of a time at which the associated network performance measurements are obtained.
In a further embodiment of the present invention, the network performance measurements associated with routings having a common set of network devices are grouped to provide network performance measurements for each of a plurality of particular routings, each of the particular routings having a different set of network devices establishing the corresponding communication connection. The provided network performance measurements and the associated particular routings are stored to provide network performance measurements for each of the particular routings between the first device and the second device.
In another embodiment of the present invention, an exception event is detected based on the obtained network performance measurements. A traceroute between the first device and the second device is run responsive to detection of the exception event. The exception event is, preferably, associated with one of the particular routings having a common set of network devices as provided by the traceroute run responsive to detection of the exception event. An exception event may be, for example, a transition from a normal to a critical condition for a performance measurement and/or a connection failure.
In another embodiment of the present invention, the network performance measurements for the particular routings between the first device and the second device and exception events for the particular routings between the first device and the second device are displayed. The network performance measurements may be displayed as an average time for each type of network performance measurement. The type of network performance characteristic may be selected from the group consisting of throughput, response time, availability and transaction rate. The exception events for the particular routings may be displayed as an exception rate.
As will further be appreciated by those of skill in the art, while described above primarily with reference to method aspects, the present invention may also be embodied as systems and/or computer program products.