1. Field of the Invention
The present application relates to monitoring distributed systems for monitoring, analysis and management and more particularly to reducing the number and determining the appropriate location of monitors used for system monitoring, analysis and management operations.
2. Description of Related Art
Management and analysis of networks and other distributed systems, e.g., computer, communication, etc., conventionally involves monitoring individual components of the system. In a network, monitored components can include hardware elements such as servers, hosts, routers, switches, links, interfaces, etc. and software elements such as operating systems, infrastructure middleware, protocol stacks, applications, services, etc. The network components are typically monitored to detect events in the network such as faults, component failures, latency, bottlenecks, etc. The nodes in a network or distributed system generally have the basic mechanisms needed to reply to monitoring requests generated by monitors. In addition, the monitors can typically communicate with one another. In a given system, the monitoring can detect properties or characteristics such as reachability, latency, throughput, utilization, status, or any other measurement that the monitors have been configured to monitor.
There are several possible sources for monitoring information in a conventional distributed system. These sources may include agents, agent-less monitors, and beacons. An agent is an entity located at or close to the monitored component, e.g., node, that can provide information regarding the component being monitoring, i.e., local information. The agent can be implemented in software, hardware, or some combination thereof. Agents typically operate under a particular network management standard such as Simple Network Management Protocol (SNMP), which allows for remote access to the monitoring information.
An agent-less monitor typically relies on an Application Program Interface (API) provided by the system itself to retrieve monitoring information. The agent-less monitor removes the necessity of placing an agent local to managed component to collect the information. Agent-less monitors are well-known and are used in software systems using standards such as Java Management Extensions (JMX), Windows Management Interface (WMI), etc.
A beacon is a software entity executing at a hosting device and is capable of monitoring internal as well as external components of the hosting device or other devices in the network. The monitoring of other devices by a beacon produces measurements perceived from the location of the beacon. For example, a beacon residing at a node or device D1 monitoring the connectivity to another node or device D2 may indicate whether information transmitted from device D1 can reach device D2.
As used herein, the term monitor or monitoring entity refers to that entity or combination of entities used for retrieving, obtaining, accessing or reading monitoring information, including but not limited to, agents, agent-less monitors and beacons as described above.
Monitoring activities in a distributed system may be more formally stated as:
Let N be the set of nodes;
Let K be any set of pairs of nodes; and
Let monitoring function g or analysis function α include:                (1) collecting measurements on all members of N with the support of the monitors; and/or        (2) collecting measurements on all members of K with the support of the monitors.        
An example of a monitoring or analysis function may be to detect failures of a node in a distributed network system or of an application component in a distributed applications. Such failure can propagate in the network (application) and manifest symptoms detected by the monitors such as the inability of a client to communicate with a server, for example. The management system may then use known techniques to identify the root cause of the problem by analyzing the observed or detected symptoms.
Prior art management of distributed systems conventionally involves using monitors for monitoring every significant component in the network or system. Such widespread monitoring of system components is necessary for proper analysis and management of the system. It is, however, costly and results in high computation overhead as high volume of traffic is needed to relay all the events that the monitoring entities detect or observe to the management stations. Furthermore, it can be difficult to scale the monitoring tasks as the number of devices grows. This is particularly true in the case where the network bandwidth is limited.
However, often not all the detected events, i.e., symptoms, are needed to complete a desired analysis function or operation, e.g., a root cause problem identification. It may for example be possible to conclude that the network or distributed application failed, even if not all of the client/server connection failures are detected and considered. A sufficient number of monitors should be provided to always determine the desired operation, e.g., the root cause, or to significantly limit the set of possible root causes for all problems of interest.
Hence, a need exists in the industry for a method and apparatus for reducing the number of monitoring entities and/or appropriately locating the monitoring entities while still substantially attaining the required system operations and management goals.