Networks are used to interconnect multiple devices, such as computing devices, and allow the communication of information between the various interconnected devices. Many organizations rely on networks to communicate information between different individuals, departments, work groups, and geographic locations. In many organizations, a network is an important resource that must operate efficiently. For example, networks are used to communicate electronic mail (e-mail), share information between individuals, and provide access to shared resources, such as printers, servers, and databases. A network failure or inefficient operation may significantly affect the ability of certain individuals or groups to perform their required functions.
A typical network contains multiple interconnected devices, including computers, servers, printers, and various other network communication devices such as routers, bridges, switches, and hubs. The multiple devices in a network are interconnected with multiple communication links that allow the various network devices to communicate with one another.
If a particular network device, a network system, or a network application fails, the network has a performance problem. When there is a performance problem, it usually involves a laborious manual process to diagnose a cause or resource of the problem. A considerable amount of time and effort has to be spent to manually diagnose and identify the cause of the performance problem.
The correlation of the performance problem and the event is difficult because similar symptoms could be caused by different problems: problems in the client or server machines, the routers, the networks, or even by other clients flooding the database server with simultaneous requests. Thus, there are some difficulties and challenges in diagnosing performance problems in a network comprised of multiple devices and systems. These difficulties discussed in more detail below.
One difficulty is that problems can originate in any network, system, or application element. Problems are not always observable where they originate.
Another difficulty is that a single problem can manifest itself as numerous symptoms in multiple elements in multiple domains, i.e., network nodes experience packet loss, database server experiences poor performance, and clients (users) experience longer response times.
Another difficulty is that different problems can manifest themselves as overlapping symptoms, i.e., different problems may exhibit the same or common symptoms.
Still another difficulty is that each managed element can experience different types of problems, and each problem can propagate from the faulty element to other network, system, or application elements making them to appear faulty as well.
Still another difficulty is that symptoms can propagate along relationships between elements, such as upwards from the physical layer to the application layer, and sideways to connected routers and hosts along the connectivity relationship.
Therefore, it becomes necessary to correlate symptoms along related elements, and analyze relationships between elements to identify the effects and cause of a problem.
However, although there is a multitude of management products and mechanisms available to generate detailed operational data about managed elements (devices, systems, and applications) in the form of events and alarms, these product and mechanisms lack the capability to systematically analyze this data and identify the cause of the problem. Thus, when there is a performance problem, it usually involves a laborious manual process to diagnose the cause or source of the problem. This task of sorting through and analyzing the alarms, event and data logs to identify the cause of the problem is left to the skill and expertise of network and system administrators. Moreover, a problem condition may produce a storm (or flood) of events and alarms that further hinder the manual diagnostic process. As a result, a considerable amount of time and effort is typically expended to manually diagnose and pinpoint the cause of performance problems.