Performing surveillance and monitoring status and performance parameters of IT (information technology) systems may be accomplished via any one or more of a number of methods, which are well known in the art.
In one method, software agents may be used. A software agent executes code, which determines the value of a parameter (for example, the percentage of CPU power used) based on a timer or based on a request received by the agent, and reports it to a remote server or console. A remote intelligence, which is either the console itself or a server between the console and the agent, receives and displays the value of the parameter. The intelligence may also perform threshold processing to determine, if the value indicates a status change of the monitored resource. An example of this scheme using the standardized Simple Network Management Protocol (SNMP) is shown in FIG. 1.
A software agent as described above may have the capability of performing threshold processing locally to determine whether or not a new value indicates a change in status of the monitored resource, and can send a message to a remote console or server to trigger further processing, such as displaying the message or storing the message. If such a message is related to the status of a business process, which requires the monitored resource, rules may be evaluated to determine the appropriate actions to be taken. An example of this variation using SNMP is shown in FIG. 2.
In another method of monitoring known in the prior art, the operating system or an application can be remotely queried using a standard protocol or a proprietary protocol to obtain parameter values, which may be obtained based on a timer or measured on request. A remote intelligence, which is either the console itself or a server between the console and the monitored resource, receives the value, displays the value, or determines, if the value indicates a status change of the monitored resource. An example of this method using Microsoft's WMI is shown in FIG. 3.
The methods mentioned above can use standard protocols, such as SNMP, or can be implemented using proprietary communication protocols. Such proprietary protocols may be found in the prior art in products such as HP's OpenView, BMC Patrol, CA Unicenter, etc.
Current IT monitoring systems raise “events” or “alerts” based on observations at the monitored resource or measurements of parameter values at the monitored resource. An example of such observations may be the presence of all required files and processes of a running software application. Likewise, examples of a measurement of a parameter value may be the temperature of the CPU chip or the cache hit rates of a database engine.
Based on observations and measurements, a “status” may be determined for each parameter, each instance of the monitored resource, the whole class of monitored resources or the overall IT-system containing the monitored resources. The health and performance of business processes can then be derived from the known alerts on those IT resources, which are required for the particular business process.
Current IT monitoring systems use the principle of thresholds defined for monitored parameter values, also called parameters or variables (SNMP) and Boolean Logic to determine the status of a monitored resource or, if an alert should be raised. See, for example, U.S. Pat. No. 5,655,081. Using thresholds and Boolean Logic may lead to results, which differ from conclusions, which may be drawn using the normal ways of human reasoning.
For example, if the percentage of used bytes on a storage device is monitored and a threshold is defined at 70%, then a conventional monitoring console will show an OK-status when the percentage of used bytes is at 69.99%. When on the next sampling interval the percentage of used bytes goes to 70.01%, the monitoring console will show a not-OK-status for the monitored resource. Additionally, traditional methods allowing decision making close to the source, such as using some form of agent as described above, may send and store event records for notifying remote consoles or servers for logging and secondary notifications.
Therefore, under prior art methods and systems, a value of 69.99% will go unnoticed, while a value of 70.01%, which just exceeds the threshold, may cause one or more reactions coupled to the raised event, such as an incident record being created, reported and stored in the central management database (CMDB). As a result, problem analysis will be started and, depending on the degree of automation, a number of personnel will have to look at the situation, make some judgment about it and initiate remedial action, because the situation is perceived as alert situation.
The value in the given example actually has changed only by less than 1 per mille. Normal human reasoning would likely dictate that no action be taken in response to such a miniscule change, but that the parameter be watched to see if a trend is developing, which may eventually fill the storage device, in which case remedial action may need to be taken to prevent the problem.
An example of Boolean logic potentially leading to overreaction can be illustrated in application monitoring, where one of the above mentioned methods known in prior art might trigger an alert based on the existence or not-existence of a running process. It is likely that only at very high levels of reasoning about the business process it is possible to determine if this operating system level process is critical to the business process or not. If this process was one of many work processes inside a multi-process application, then in many cases the application will recover from the situation by restarting the process or it may even be that the application had terminated the process deliberately and does not want it to be restarted.
If intelligence more like human reasoning could be applied at the source of the alert instead of simple Boolean logic, then the alert may not have been triggered.
Some traditional IT-management systems try to overcome the inadequacy of applying Boolean Logic to thresholds by defining multiple thresholds for various levels of alert, such as “warning”, “alarm”, “critical” or other schemes of thresholds associated with different “severities”. This only multiplies the underlying problem of a miniscule change triggering or not triggering an alert and a change of status of the monitored resource. It also increases the number of alert messages, which need to be processed, because the smaller the intervals among thresholds become, the more alerts will be triggered by fluctuations of the parameter value. The example in FIG. 4 has 18 threshold crossings on 9 thresholds, while there are only 2 threshold crossings in a 2-threshold scheme.
Monitoring the rate of change of a parameter value, and having a threshold defined for it, is also a way to soften the impact of the problem. When examining the leading products in the field, it has been found that only a few utilize the monitoring of the rate of change of a parameter.
Current IT management systems typically utilize one or two commonly known ways to inform IT management personnel about the current situation. In one method, one or more lists of messages are displayed on a console. The displayed messages may be colored-coded according to a scheme, which relates a specific color to each severity level of the message. In a second method the monitored resources are represented by a graphical object (typically an icon) on the console. The graphical objects may form a hierarchy, which visualizes the relation among the resources. The graphical objects may be color-coded based on a status derived from the worst severity message which has not been acknowledged by IT management personnel, or the status derived from threshold analysis or its upwards propagation in the hierarchy of resources.
Additionally a few products allow a “drill down” (typically using mouse clicks) to a graphical representation of raw parameter values via gauges, or various forms of graphs.
Fuzzy Logic and its Traditional Applications.
Boolean logic is a 2 state logic (FALSE, TRUE) with operations such as NOT, AND, OR, XOR. It has been known since the ancient Greek philosophers used it and it entered the digital age, when ‘flip chip’ modules were invented (pieces of hardware, which evaluated Boolean expressions) as predecessors to the digital computers. By assigning TRUE to “1” and FALSE to “0” a set of logical values became equivalent to a binary number.
Boolean logic is still the base of most digital computing, but when a human uses a computer to computes a mathematical function, no thought is given to what happens in each transistor of the computer. However, the tendency remains, when making decisions in programs, to fall back to a very low level and use Boolean logic to implement something very complex, such as reasoning.
The switch for a light bulb is still used as the classic example for Boolean logic. The switch is either OFF or ON and, consequently, the light bulb is OFF or ON. With multiple light bulbs and switches the Boolean operations of A AND B, A OR B, NOT A, and so on, can be nicely demonstrated.
The invention of the “dimmer” allows dimming the light to have a continuum of brightness between OFF and ON. The threshold concept could now be used to define that at a brightness of >80% the light is called ON and otherwise it is called OFF. The example appears to be an arbitrary selection. A threshold at 50% would work as well. This is exactly what currently is done in IT management. The definition of the thresholds is typically not a consequence of a precise calculation or a determination from necessary conditions, it is an arbitrary value within a range of values, which seem to be reasonable based on experience.
In the same way that a person would have difficulties to explain, why a brightness of below 80% should be considered to be OFF, an IT manager has difficulties explaining why a disk which is 69.99% full, is OK, while a disk, which is 70.01% full, is not-OK.
The concept of “fuzzy sets” has been known since about 1965. Since then, the theory grew into the concept of “fuzzy logic”. Today fuzzy logic finds increasing acceptance in control circuits of industrial processes (e.g. concrete mixing), commuter trains (brakes in Tokyo subway) as well as in household appliances (vacuum cleaners, washing machines, heating systems, etc.).
Its main feature is that fuzzy logic allows making practical decisions in situations, which are either not analytically understood or far too complex for a complete analytical representation. It could also happen that the complete calculation-intensive analytical model offers no benefit over a simple fuzzy logic approach.
In traditional set theory the set membership is described by Boolean logic (A is a member of set X or not). The set membership in fuzzy set theory is derived from applying a “membership function”, which, in normalized form, will return a value between zero and one. Thus, situations where human reasoning would conclude “rarely”, “sometimes”, “often” rather than “always” or “never” can be easily modeled. An easy to understand example of membership functions is the special case where the membership function returns a probability (normalized between zero and one). For example, Independence Day has a membership function value of about 1/7 for each day in the set of days-of-the-week.
It would therefore be advantageous to provide a method and apparatus to overcome the disadvantages of the threshold concept and Boolean logic based status processing and propagation.