A monitoring solution can be used in software systems to check the data outputs and confirm whether these outputs are within acceptable parameters. In the event that the data is not within acceptable parameters, a monitoring alert can be issued to notify maintenance personnel of potential problems with the health of the system.
Software components are interdependent, and a fault condition with a single component can result in a cascade of fault conditions with a number of interrelated components in the system. Thus, a number of different alerts can be sounded for the same fault condition. These multiple alerts can become noise since action is required for only a single component, not each affected component. This noise can make it difficult for administrators to identify the root cause of the fault condition.
Noise can also arise from other sources. Multiple health indicators can be used to monitor the same issue from different perspectives in order to improve monitoring coverage or robustness. These health indicators can be useful individually, but can be redundant when the indicators all independently discover the issue at about the same time.
Additionally, while monitoring information can be useful for analyzing system performance, the information is not necessarily useful for alerting since only partial information about the problem is indicated. The administrators need not immediately work on the problem unless other indicators also raise alerts. Data of this type is not actionable and becomes noise if presented in the form of an alert. Many monitoring solutions today collect such “forensic” data to ease troubleshooting, though such data is typically presented in the form of an alert that can produce noise.
Noise can also occur if multiple valid alerts having different scope or severity are raised at about the same time. The lesser issues can make it difficult to isolate and identify the greater issues, thereby requiring extra time and effort by system administrators to ascertain the source of the problem.
Solutions for noise control are known in which specific correlation rules are written to describe relationships between individual health indicators to accommodate specific problem scenarios (e.g., certain problem alerts are issued upon certain concurrent combinations of health indicators). However, such solutions have drawbacks.
Since each alert condition needs its own rule, a large number of rules are required, and can still fail to accommodate every potential problem path. Additionally, different rules can correlate to the same health indicator, and if the rules are evaluated separately, the same problem can be reported multiple times. Further, a single health indicator can exist in multiple problem paths, and if a shared health indicator is updated or removed from the health model, all the associated rules need to be updated. Still further, such noise reduction solutions do not perform well if the components belong to a different team or product, since errors can be introduced by the foreign components. Probability-based noise reduction solutions are known for estimating statistical likelihoods for root cause candidate. However, it can be difficult to define good probability numbers for each cause-impact link, since the impact of changing one probability number is often not intuitive.