There are a number of network management systems available. These systems gather fault information from disparate devices across a network and then correlate, categorize, prioritize and present this information in a form that allows an operator to manage the network and repair it efficiently. In addition, basic predictive statistical analytic techniques have been applied to operational data gathered from network devices to predict potential future problems.
Network management involves gathering data from a range of devices in a network. Known implementations use a large variety of monitoring devices such as probes or agents to perform this task which can provide a large amount of source data from many types of network devices and systems.
One of the problems with managing very large networks is that there are network failure modes that can result in a very large number of fault events, particularly when a network cascade failure occurs. The high number of fault events can flood the network management system, making it unresponsive and rendering it difficult for an operator to isolate the original cause of the failure or prioritize the repair effort efficiently. In existing solutions a monitoring probe (that may gather data from multiple devices) can initiate a shutdown once the fault event rate exceeds a given threshold, and then initiate a restart once the level drops back below the threshold. However, by this point a cascade failure has often already started to occur, and many other devices may have started to flood the management system. There will typically also already be a large number of fault events resident in the system before this basic form of flood protection is activated. Disadvantageously, this solution also results in a large amount of data loss including information that may be vital to fixing the network. Furthermore, if the probe is monitoring multiple devices, then all data from all devices is lost even if only one of them is producing the event flood. Finally intelligent central administration of how probes manage a data flood is not possible.
For example, U.S. Pat. No. 7,539,752 discloses detecting event numbers exceeding a fixed threshold and causing the number of events permitted to be throttled back. As a further example, United States Patent Application No. 20100052924 discloses detecting event numbers exceeding a fixed threshold and causing event information to be buffered. This means that information becomes unavailable for managing the system during the event flooding incident.
Existing predictive analytic systems often concentrate on device metrics that display a simple progression before the device develops a fault condition. For example, fitting a linear trend to disk space or central processing unit (CPU) usage to predict a future problem, or performing an historical analysis of these metrics to indicate abnormal usage. Again, in each case, the predictive data relies on a fixed threshold to determine the abnormality, and these systems cannot take a flexible approach to device-specific fault event rates, as this metric is much harder to gather and analyze.
Therefore, there is a need to address the aforementioned problems in network systems according to the present state of the art.