The present invention relates to the monitoring of network elements comprising a high speed managed data network, and more particularly to a dynamically adaptive network element telemetry system.
A network needs to be monitored for the existence, disappearance, reappearance and status of traditional network devices such as routers, hubs and bridges and more recently high speed switching devices such as ATM, Frame Relay, DSL, VoIP and Cable Modems.
In order to generate meaningful performance reports a network management system is required to collect and process primarily two types of data. Network Topology Data and Performance Data. Network Topology Data defines what each object is and where it is located in the network hierarchy, while Performance Data are scalar values representing the management variable for each object at specified time intervals,
Network monitoring is traditionally achieved through the polling of the elements. A typical initial installation configuration of a management system known in the art results in the network element polling rate being set at a fixed default level, typically 15 minutes. With the timing of samples and delays associated with data processing, this results in the management system being able to generate reports within a couple of polling cycles.
Periodically, there is a need to increase this default polling rate by decreasing the polling interval on specific network elements to allow for closer scrutiny of the network under management. The problem is that most existing systems are unable to change polling parameters xe2x80x9con-the-flyxe2x80x9d. Instead, it has been necessary to perform labourious manipulations for each desired change, followed by a disruption of polling activity to allow for the transmission of the resulting changes, and finally having to wait for a couple of polling cycles to occur in order to generate reports based on the changes.
This results in delays in the availability of information, delays typically in excess of the time this information is required. The solution to this problem therefore, must comprise a method for rapidly modifying not only the polling rate of specific elements, but any operational parameter required for network telemetry such as a fall back parameter that controls the retry behaviour of a transaction that fails due to network congestion.
For the foregoing reasons, there is a need for a method of network element telemetry that provides for the localized, low-latency re-configuration and reporting of monitoring transactions without a disruption of polling activity.
The present invention is directed to a dynamically adaptive network element telemetry system that satisfies this need. The system, leveraging the functionality of a high speed communications network comprising a network element telemetry infrastructure comprised of at least one Performance Monitor (PM) server computer and at least one, and preferably a plurality of Data Collection (DC) node computers comprising a Data Collection Process (DCP) comprising a command handler, the system further comprises a Single Distributed Arena (SDA) encompassing the network element telemetry infrastructure to form a single large parallel virtual application wherein the DCP further comprises a run time telemetry parameter operational change application. The SDA further comprises a telemetry control and collected data filter application providing the primary interface between the DCP and the SDA and a performance telemetry controller to enable the suspension, resumption or change of the parameters of performance telemetry for any network element.
In an aspect of the invention, there is a defined maximum interval of time for which any element can be fast-polled more often than the default rate. At programmed intervals the server will traverse the list of known network elements and any elements that have been polled longer than permissible will be restored to the background rate. Permissible time is calculated from the first rate change request and is not reset for subsequent requests.
In an aspect of the invention, limits on the number of concurrent fast-polls per managed device is enforced.
In an aspect of the invention, provision is made in the protocol for clients to specify formulae containing more than one telemetric parameter. This allows a client to monitor more than one separately generated statistic derived from data received in the same probe to minimize measurement traffic to the managed device.
As well, measurement traffic to the managed device is not increased with additional subscribing clients on the same element since these additional clients get their information from the same data stream.
The invention provides for more granular monitoring of identified trouble spots in the network than that under the default background polling rate and without a disruption of polling activity. By enabling the user to increase the frequency of polling for specific network elements, more information can be generated over a shorter period of time relating to what is happening.
Near real-time access to the management telemetry stream provides the user with a tight watch on possible troubled areas of the network by making increased telemetry available for any managed element that the system has flagged as performing outside the normative range.
The invention enables rapid and dynamic control of the operational parameters of management transactions conducted by the DC node computer on behalf of the PM server computer.
As well, the invention safeguards against over-management of delicate or heavily loaded devices.