This invention relates to the field of measurement of activity of various kinds, such as (but is not restricted to) traffic in a data communication network, and more particularly to a method of measuring the activity to an improved accuracy.
In measuring activity of various kinds, such as electrical, fluid, information, object, etc. flow, and conditions, performance, and other activities, measurements are taken at particular times. These measurements often result in the determination of outlier points, particularly in conditions where the distribution of points is unpredictable. This problem is very severe in data communication networks.
Measurements are usually made with error; a set of measurements of the same value is often normally distributed about an exact point. However, some of these measurements will have enormous errors associated with them, due to some gross experimental mistake such as recording millimeters rather than meters. The detection and rejection of these outliers can be readily made where the expected distribution of the measurements is known and has a finite variance. However, in self similar distributions the variance is infinite (or is only limited by the maximum capacity for activity of the object). Outlier points in self similar distributions cannot therefore be recognized and rejected by using the expected distribution.
Leland et al, as described in the publication xe2x80x9cOn the Self Similar Nature of Ethernet Trafficxe2x80x9d by W. E. Leland, W. Willinger, M. S. Taqqu, D. V. Wilson: ACM SIFCOMM, Computer Communication Review: pp 204-213, January 1995, discovered that the distribution of data network traffic is self similar, so that traffic measurements made on data networks suffer from the problem of outlier rejection.
Moreover, the frequency of outliers in data networks is extremely high. When measured over a wide variety of data networks the average outlier rate was 1%, practically all of these outliers being high. These outliers make the detection of alarm levels of activity in data networks very prone to error and also makes the forecasting of activity in data network very unreliable.
The only previously known general outlier detection method is one which rejects outlier measurements if the measurement is greater than or less than a possible range of values. All other specific methods rely on knowledge of the distribution of possible values.
A complex model allows the use of more than one variable in forecasting the distribution of a single variable, where feasible. In the absence of any model, the previous distribution of a variable is the best forecaster of the future distribution of that variable. Data communications network managers very much want to know when their networks or parts of their networks will run out of capacity. It was believed for many years that models of network behaviour could predict the future; since this used to be true for voice networks, it was assumed it would be true for data. Therefore models were developed which tried to predict future peaks based on previous means and peaks.
However, the work of Leland et al referred to above, and many others subsequently, have now shown that data network traffic is self similar (fractal). This implies that the variance of data network traffic is not only infinite but also is not even related to the mean. Therefore attempting to predict future peaks using any model that includes the mean is clearly wrong.
Moreover, if the variance is truly infinite, future peaks cannot even be predicted from previous peaks. In other words, self similar distributions may have a lower limit but do not have an upper limit.
However, it had not previously been observed that since communications lines do have upper limits in capacity, therefore the distributions in them cannot be truly self similar and their variances are definitely finite. Under these conditions the previous peak values can be used to predict future peak values, but the relationship between the mean and the peak remains indeterminate.
The idea of using linear fits to prior peaks to forecast future peaks in data communications networks had been previously invented by N. W. Dawes. Once tested, however, the problem of the peaks being heavily contaminated by invalid data points was noticed. The problem rate was found by experiment to be very high, with most forecasts being significantly faulty. Moreover, attempting to report the peak activity of any port in a network (the top talker) over even the last 24 hours was found to be routinely wrong, as the following will illustrate.
Activity values recorded by data communications devices about their own activity has been found in practice to be astonishingly error prone, with an average outlier rate of 1%. For example, attempting to determine the daily peak traffic rate on a single interface by measuring the rate every minute requires measuring 1,440 points per day, but on average 14 of these would be outliers, almost all being high. The daily peak point under these conditions would be 14 times more likely to be an outlier than a genuine value. The outliers were observed to be randomly distributed, so a simple filter that rejected activity levels outside the physical capacity of the interface was added. This rejected 10,000 outliers for every 1 accepted.
However, in monitoring even moderate sized networks of 1,000 devices and 10,000 communications interfaces, about 10 outliers still passed through this filter every day (scattered over these 10,000 interfaces). This left about 4% of all forecasts seriously in error. Moreover, analyses such as finding the busiest interface even just over the last day were routinely wrong. Analysis of the immediately previous year on such a network (a not uncommon requirement) would require 3.65xc3x97109 points to be cleared of outliers. A far better outlier rejection method is clearly required to enable both accurate historical analysis and accurate forecasting in data communications networks.
The present invention provides a method that rejected in a successful prototype approximately 1015 outliers for every 1 accepted (in Ethernet networks), while rejecting effectively no genuine points. The invention provides similar performance on ATM, Frame Relay and other protocol based data communications networks. The present invention therefore renders practical and effective the linear forecasting method mentioned above, surprisingly only requiring use of peak data. The method can be used as a filter for the measured points.
It is an important aspect of the present invention that it does not rely on knowledge of the distribution of possible values. It provides very reliable detection and rejection of outliers and so enables very significant improvements in the accuracy of both alarm detection and activity forecasting.
The present invention has application to all fields that involve the measurement of self similar activity and all fields in which measurable activity flows from one object to another. The set of fields with self similar distributions to which the present invention has application is enormous. Therefore the small fraction of those given as examples in this specification are only some of those in which the present invention has applicability. Further, the set of fields which include measurable flows is similarly vast. The embodiments described herein should only be taken as representative of those applications, and the present invention is applicable to all such fields.
In accordance with an embodiment of the present invention, a method of detecting outliers measured during progression of an activity of an entity from one point to another, comprises measuring activity at a point in a first dimension, measuring the same activity at the same point in a second dimension at the same time as measuring the activity in the first dimension, and rejecting outliers which have values outside a maximum expected difference between the activity measured in the first and second dimensions.
The invention requires that a particular activity should be measured at the same time using different devices in different dimensions. If the two measurements disagree by more than a maximum experimental difference expected between these devices, an outlier is declared to have been detected. The maximum acceptable experimental difference is now not related to the variance with time of the measured activity. Two examples will now be given: the first uses the dimension of distance, the other uses other dimensions.
(a) Consider the traffic to be flowing from point A to point B, wherein the traffic leaving point A is the same as that arriving at point B. Therefore measuring the flow rate both at point A and at point B is the same as measuring the flow rate twice at point A, simultaneously, after adjusting for the time of flight.
This general method requires prior knowledge that A is connected to B, which can be determined by a general method, such as that described in the U.S. Pat. No. 5,926,462 issued Jul. 20, 1999 entitled Method of Determining Topology of a Network of Objectsxe2x80x9d, invented by N. W. Dawes, D. Schenkel and M. Slavitch.
(b) Consider the traffic flow from point A. The flow rate should be measured in two dimensions at once at point A. Adjusting for the ratio or difference in dimensions, this too is equivalent to measuring the flow rate twice. For example, in data networks a pair of such dimensions is bytes/second and frames/second.
In a given medium type, the maximum and minimum ratios of bytes per frame are defined by standards. In Ethernet media there can only be between 64 and 1500 bytes per frame. Therefore if the ratio of the measurements of flow in bytes per second to frames per second falls outside the range 64 to 1500, an outlier has been detected.
A novel aspect of the embodiment in which synchronized measurements are made in different dimensions is that of the requirement of different dimensionality.