The invention generally relates to the field of supervision or monitoring of errors, or xe2x80x9cdisturbancesxe2x80x9d, in processes. As a specific field in this regard performance management according to TMN (Telecommunications Management Networks) standards of telecommunication operations can be mentioned, cf. ITUT-T Recommendation M.3400.
More particularly, the invention relates to a method for performing, in a computer-controlled process, an algorithm-controlled monitoring of disturbances apt to occur at random or in bursts in the process, said monitoring using counting values obtained from a counter for counting said disturbances.
There are many examples of disturbances in a software-controlled telecommunication system, among which can be mentioned parity errors, sporadic hardware faults, bit-correction errors, cyclic-redundancy-check (CRC) errors, congested call attempts, synchronization slip, protocol errors, signalling errors in line or register signalling, program exception during run-time, violation of the software contract at an interface.
There are also many cases of disturbance outside the field of telecommunications, such as errors appearing when making a copy on a photocopier, false results in a blood test, misfiring of an internal-combustion engine, production faults in the manufacture of an electronic component or of a printed-circuit board.
All such disturbances are unavoidable, and there is no reason to intervene for a single disturbance in order to find its cause. However, it is necessary to monitor automatically the disturbance rate or frequency. If the disturbance frequency remains at a low predictable and acceptable level, this can be accepted. But if the rate of disturbances rises to an unacceptable level, then the monitoring mechanism must raise an alarm, or send a notification, requesting manual intervention to find the cause of the excess disturbances.
In the field of telecommunications, a specific form of disturbance monitoring has been known as xe2x80x9cdisturbance supervisionxe2x80x9d, as described in U.S. Pat. No. 5.377.195, and implemented in the Ericsson AXE 10 system. Currently, the expression xe2x80x9cQOS measurementxe2x80x9d (Quality-Of-Service measurement) is used, as part of the performance management specified by TMN standards. QOS measurements do not consider the physical processes that cause disturbances.
QOS measurements are well specified by the standards, cf. for example, ITU-T G.821 on #7 signalling, concerning error rates. However, there are no guidelines on how to set thresholds so as to obtain meaningful results. In practice, thresholds are set empirically. There is no method for setting thresholds in a systematic way. Often, the results from QOS measurements are so unreliable that they are worse than useless. They give false results, and can be such an irritant to maintenance personnel that the measurements are turned off.
There are several possible algorithms that can be used in QOS measurements. One of these is the so-called Leaky Bucket algorithm. This algorithm is potentially a well usable algorithm for QOS, but it is associated with some problems which need to be solved. The mathematical analysis of the leaky bucket is not easy. There is too little knowledge available about the behaviour of the disturbances that need to be measured by QOS. In practice, disturbances do not occur at random, which is relatively easy to analyse, but in bursts, which is less easy. A satisfactory solution to the problem requires that bursty behaviour should be treated correctly. As the behaviour of QOS measurements is stochastic, no results are 100% reliable. There is always a risk of false positive or false negative results. These risks must be taken into consideration when setting good values for the thresholds.
In fact there have seemed to be no satisfactory solutions available to these problems.
The method according to the invention, as defined by way of introduction, deals with the above discussed problems by comprising the steps of
i) defining an abnormal event regarded to be a disturbance,
ii) defining a base against which disturbances are to be counted,
iii) defining a unit to be used as a measure of a disturbance frequency,
iv) determining values of the disturbance frequency in a variety of circumstances that can be expected in operation of a process generating the disturbance to be monitored, said values including a critical value fC of the disturbance frequency where the monitoring nominally issues an alarm,
v) determining for the process, at said critical value, a peakedness factor F, being a measure of how bursty the disturbances are, as the ratio of the variance to the mean of occurrences of disturbances in the process,
vi) choosing for the algorithm an inertia value J being a measure of how fast or slowly the algorithm is desired to react to changes in the disturbance frequency, so as to achieve an acceptable compromise between speed and reliability of the monitoring,
vii) calculating parameters for the monitoring based upon the disturbance frequency value fC, the peakedness factor F and the inertia value J, and using said parameters to calculate according to 1/fC*J*F a threshold value T of the counter considered to be unacceptable,
iix) designing the algorithm for the monitoring with said parameters,
ix) initiating the monitoring and waiting for results thereof,
x) evaluating the results and, if necessary, adjusting the parameters.
In the above defined method the step of defining a base comprises determining whether the base should be a unit of time, a base event, or an artificial base, the outcome being a random variable able to take a value indicating normal event or disturbance.
In an important embodiment of the invention the condition is used that the disturbance frequency measured against all base events is indistinguishable from the frequency measured just against normal events.
In a further embodiment of the invention, there is determined, besides the value of the critical frequency, the values of one or more of the following further levels of the disturbance frequency:
fN=normal frequency in operation,
fR=raised frequency in operation, but one that is still acceptable,
fE=excessive frequency, at which the working of the equipment is degraded,
fU=unacceptable frequency, where there are too many disturbances for normal operation.
In a further very important embodiment of the invention the bursty behaviour is considered solely on the basis of the peakedness factor, together with the disturbance frequency.
In one embodiment of the invention, using the Leaky Bucket algorithm, the value for the inertia is used as a multiplier on the size of the leaky bucket.
A further embodiment of the method according to the invention includes the step of producing a risk table including a number of columns, of which four columns contain, in turn, level of disturbance frequency, bias, being expected change of a counter value after a base event, value of the disturbance frequency, and risk of false result, respectively, by selecting a suitable set of values of the bias, calculating values of the disturbance frequency by adjusting the critical frequency with the respective values of the bias, and setting values for risks based upon measurements, economic analysis, experience, judgement or intuition.
In a further embodiment of the method according to the invention, the step of evaluating the results comprises
a first substep of investigating whether measurements can be regarded as reliable, and, if yes, ending by taking no further action,
a second substep that, if the first substep reveals that measurements are not reliable, comprises investigating three possible sources of error, viz. whether 1) there are too many false alarms, 2) faulty equipment stays in service, or 3) the time to get results is too long, and
on a third substep level,
performing either of the following three steps,
(i) if there are too many false alarms, increasing the value of fC, or increasing the value of J or F, by recalculating d and T and returning to first substep,
(ii) if faulty equipment stays in service without raising an alarm, reducing fC, or reducing J or F, recalculating d and T and returning to the first substep,
(iii) if the time to get results is too long, reducing the value of J or F, recalculating d and T and returning to the first substep.
According to an important embodiment of the invention, the evaluating step includes a step of determining the probability of obtaining a false result in the monitoring, based upon using a Leaky Bucket algorithm in which said probability is defined as u(d,b,h,F), wherein
d=disturbance step is the amount by which a leaky bucket counter is incremented for each disturbance,
b=bias is the expected change of a counter value after a base event, b less than 0 implying a false positive result obtained when alarm is given, even though there is nothing wrong with a supervised object, and b greater than 0 implying a false negative result obtained when no alarm is given, even though there is something wrong with the supervised object,
h=size of the bucket, measured in units of the disturbance step,
F=peakedness factor for the disturbance process.
In the above connection, the step of determining the probability of obtaining a false result can include the substeps of
entering as parameters:
disturbance step d, bias b and size h of bucket, initializing as variables:
r=P{normal event}/P{disturbance}, wherein P{normal event} means probability of a normal event appearing and P{disturbance} means probability of a disturbance appearing,
a=h*d being size of the bucket in units of 1, determining whether bias b=0,  less than 0 or  greater than 0,
calculating, if bias=0, boundaries of probability u(a/2), while using inequality             (              a        -        z            )        a     less than =      u    ⁢          (      z      )         less than =            (              a        +        d        -        z        -        1            )              (              a        +        d        -        1            )      
wherein u(z) means probability of hitting the floor of the bucket, given starting point z,
producing upper and lower bounds, and average for the probability u(a/2),
solving with binary search, if bias is not =0, the equation f(s)=r+s**(d+1)xe2x88x92(r+1)*s=0, in either the range 1 less than s less than 2 for b less than 0, or in the range 0 less than s less than 1 for b greater than 0, wherein s is a dummy variable,
calculating boundaries of probability u(a/2) using inequality             (                        s          a                -                  s          z                    )              (                        s          a                -        1            )         less than =      u    ⁢          (      z      )         less than =            (                        s                      (                          a              +              d              -              1                        )                          -                  s          z                    )              (                        s                      (                          a              +              d              -              1                        )                          -        1            )      
producing upper and lower bounds, and average, for probability u(a/2)
The step of determining the probability of obtaining a false result can include the substeps of
entering as parameters:
disturbance step d, bias b, peakedness F and size h of bucket,
initializing as variables:
a state transition probability matrix:                     base        ⁢                  xe2x80x83                ⁢        event                                      X          ⁡                      (                          n              +              1                        )                          =                            xe2x80x83                    0              1                          base        ⁢                  xe2x80x83                ⁢        event                                      X          ⁡                      (            n            )                          =                                                  0                                                1                                                                                  [              p                                                                          [              Q                                                                                              q              ]                                                                          P              ]                                          
xe2x80x83where:
P greater than q and Q less than p;
p=P{X(n)=normal event, 0 and X(n+1)=normal event, 0},
q=P{X(n)=normal event, 0 and X(n+1)=disturbance, 1},
Q=P{X(n)=disturbance, 1 and X(n+1)=normal event, 0},
P=P{X(n)=disturbance, 1 and X(n+1)=disturbance, 1};
the steady-state probabilities for the two-state model are:
x=P{x(n)=0}=Q/(Q+q)
y=P{x(n)=1}=q/(Q+q)
xe2x80x83probability distribution for time=0, performing in a loop through time t while weight xcx9c greater than 0.000001, weight being the probability of the counter remaining between the boundaries of the bucket, the substeps of
xe2x80x83calculating probability P{state=0 and counter=i} at time=t+1,
xe2x80x83calculating probability P{state=1 and counter=i} at time=t+1,
xe2x80x83calculating probability P{counter hitting floor or ceiling} at time=t+1,
xe2x80x83calculating component of mean and mean square for duration of measurement at time =t+1,
xe2x80x83calculating weight, preparing for the next iteration of the loop by shifting values, and ending loop,
xe2x80x83calculating variance and standard deviation of duration for the measurement,
xe2x80x83producing probability of hitting floor and hitting ceiling,
xe2x80x83producing mean and standard deviation of duration.