With increased use of networked computing devices and systems, it has become necessary to monitor these systems for difficulties since these difficulties can have a far-reaching impact. Monitoring these systems manually in a traditional manner is impractical and in many instances is impossible.
Seasoned experts can listen to the “hum” of an engine or hundreds of machines in a factory and determine if the machines are operating properly. This technique cannot be used effectively with servers or data centers. Analysis of a server or data center requires a subject matter expert who is familiar with the normal ebb and tide of the business cycles and specific effects on the server in question that will enable the expert to make hundreds of measurements before coming up with a conclusion. Performing this analysis manually on hundreds of servers in a data center or across an enterprise would be overly burdensome and probably impossible.
Accordingly, techniques have been developed for monitoring computing systems for abnormalities. One technique identifies parameters for sampling and compares the sampled results against a fixed threshold. If the fixed threshold is exceeded, the technique identifies an abnormality. This technique often obtains erroneous results. Since no two systems are likely to have identical configurations, conditions, and usage patterns, thresholds are difficult to set. If the threshold has been set too high, a large percentage of abnormalities go undetected and if the threshold has been set too low, the technique will result in an excessive number of alerts in normal situations. Hence despite cost and complexity, this technique causes over-reporting in some areas and under-reporting in others and even these areas change over a period of a single day.
Currently, a significant portion of the data collected from agents is numeric data. The portion can increase ten to one hundred or more times as the number of variables tracked and the resolution needed increases. The data may ultimately be used for aggregation, trending, and capacity planning and reporting. In most cases the raw data collected is never used. The burden of this data collection seriously limits scalability.
These difficulties have resulted in customer demands for consulting services for fine-tuning of thresholds. Customers additionally have increased demands for knowledge authoring environments so that administrators can make custom changes and for provisions for overriding rules including conflict resolution policies at a group and server level. Additionally, customers have increased demands for deep discovery of attributes for personalization at a per instance level.
Unfortunately, these demands have been difficult to meet. Most administrators do not understand the variables involved in their systems and additionally do not understand their own installations well enough to set thresholds judiciously. Furthermore, personalization on a large number of servers is too large of a task even for experts. Consulting based solely on system parameters fails to account for cyclical business rhythms and is apt to overlook a majority of abnormalities. Finally, even if consultants are initially able to address the needs of a particular system, the thresholds become rapidly obsolete as business cycles and configurations change.
Currently existing techniques using consulting agents have involved a large amount of data collection and storage. The amount of data can increase rapidly as variables are tracked over time. Much of the data collected is never used and the retention of excessive data limits system scalability.
A technique is needed for automatically providing abnormality detection while avoiding the aforementioned difficulties. The technique should avoid retention of excessive data and should be adaptable to functioning within a variety of environments and processes.