Random sampling is widely used in database systems to select representative elements from a data set. The advantage of random sampling is that the analysis of the selected samples requires much less computational power and storage capacity than the processing of the entire data set. Nevertheless, provided that the samples are selected carefully, the results obtained can be statistically valid for the entire data set.
Stream sampling of data elements in a data stream is a special form of sampling where the number of elements in the source data set is not known a priori. This kind of sampling can be used, for example, in the real-time monitoring of the performance of packet streams in communication networks. In this example, a packet stream can be (i) all packets within a flow in an Internet Protocol (IP) network (identified by for example, the source, destination addresses, port numbers and protocol fields in the IP packets); or (ii) all packets transferred on a communication link during a specified time interval.
If a data stream is present in multiple physical/logical observation nodes (for example a packet stream will pass through a number of different nodes in a communication network), then it may be required to correlate the selected samples among these nodes. It can be necessary to keep track of the same data stream elements among multiple observation nodes in order to be able to calculate certain statistics for the sampled records. In a communication network these statistics can include the delay and loss ratio of the data packets.
One conventional sampling technique described in “Random Sampling with a Reservoir” ACM Transactions on Mathematical Software, 11(1), March 1985, 37-57 provides an upper bound for the number of samples to be selected from the data stream which helps to limit the memory and processing power required for the analysis of the samples, while making sure that the number of samples selected is statistically representative. This technique uses a fixed-size randomized sample set (a reservoir). However, the problem with this approach is that the sampling process is completely random, making the correlation of samples at different observation nodes impossible.
An alternative sampling technique (described in U.S. Pat. No. 6,873,600) focuses on packet stream sampling in a communication network and uses hashes for consistent packet selection. Hash functions are well known in the art as functions which convert a variable-sized input and return a fixed size output, for example a single integer. However, it is not possible for the number of samples taken to be upper bounded in this technique while making sure that the set of samples are statistically representative (i.e. while making sure that they are not all selected from an initial portion of the data stream), which makes the amount of sampled data hard to control.
Therefore, there is a need for a method and apparatus for data stream sampling that can be used in a communication network that overcomes the disadvantages with the known sampling techniques.