This invention relates to methods for deriving statistical information from a stream of data, and for updating that information.
There are many practical problems of data collection in which it is useful to summarize large volumes of data in a fast and reliable manner, while preserving as much information as possible. One class of problems of that kind relates to the collection of data that reflect the performance of a communication network. One example from that class of problems is the problem of collecting and summarizing the length of time to complete each of a sequence of transactions that take place on a network. In the case, e.g., of e-mail transaction times, the problem is made more difficult by the fact that the data arrive in small increments at random times, and because it is often desirable to reserve for processing and storing the data an amount of memory that is small relative to the volume of data to be processed.
Those of skill in the art have long been acquainted with the histogram as a means for summarizing statistical data. To create a histogram, the practitioner marks off endpoints along an axis that corresponds to the incoming data; i.e., to the measured values of the statistic that is to be characterized. Below, we will refer to the incoming measured values as scores, and to the corresponding axis as the data axis. Accordingly, the endpoints referred to above are marked off along the score axis. Each pair of successive endpoints defines an interval. The height of the histogram within each interval, measured along a probability axis perpendicular to the data axis, is proportional to the number of scores that fall within that interval. Below, we will find it convenient to refer to each such interval as a bucket.
When a histogram is based on an exhaustive set of data, it can dependably represent the statistical distribution of those data. However, if the histogram is based on only a partial set of data, it might not dependably represent the full population from which that partial set was taken. In particular, a histogram based on an initial portion of a stream of data might differ substantially from a histogram based on a longer initial portion or on a subsequent portion of the data stream.
When a stream of data arrives over time, it is often most convenient to characterize the arriving data by taking an initial sequence of data values, creating a histogram, and then updating the histogram using further data values taken from subsequently arriving data. Such a procedure is especially useful when the amount of computer memory available for processing and storing the data is limited.
The quality of a histogram depends on its ability to model the population of data that it is based on, and on its ability to preserve statistical information about that population. In both of these aspects, the quality of a histogram is affected by the setting of the endpoints that define the respective buckets, and also by the procedure used to update the histogram using later-arriving data.
In the statistical study of network performance, among other fields, there has been a recognized need for methods of processing data streams to characterize the data more reliably without sacrificing useful statistical information.
We have invented an improved method for acquiring statistical information from data, which may, e.g., be already accumulated data or data that are arriving as a data stream. According to the inventive method, an initial cumulative distribution function (CDF) that characterizes an initial set of data is acquired. The acquisition of this CDF comprises acquiring a set of quantile endpoints that define the CDF. The quantile endpoints are endpoints that could be marked off along the data axis in such a way that a defined fraction of the sampled scores would lie within each corresponding bucket.
At least one additional CDF, which characterizes a further set of data, is also acquired. Information that describes the initial CDF is combined with information that describes one or more additional CDFs, and the result is used to obtain a composite CDF that describes a combined set of data that includes the initial data set and the one or more further data sets. Then, a new set of quantile endpoints is determined, that defines the composite CDF. The sequence of steps described above is repeated at least once more. The previously obtained composite CDF is used as the initial CDF for each repetition of this sequence.