1. Field of the Invention
The embodiments of the invention generally relate to sorting data streams and, more particularly, to a method of obtaining uniform data samples from selected intervals in a data stream and of estimating the distance from monotonicity (i.e., sortedness) of the data stream based on the samples.
2. Description of the Related Art
In applications that access data in a streaming fashion (e.g. large data sets and internet packet routing), it is often desirable to estimate the sortedness of the data stream; however, is difficult to estimate sortedness without actually sorting the data stream.
More particularly, a sequence a of length n over an alphabet Σ={1, . . . , m}, is said to be montone (or in increasing sorted order) if:σ(1)≦σ(2) . . . ≦σ(n)The distance from monotonicity of a sequence σ denoted by Ed(σ) is the minimum number of edit operations needed to make it monotone. A single edit operation consists of deleting a character and inserting it in a new position. If m=n and σ consists of n distinct characters, then Ed(σ) corresponds to the so-called Ulam distance between σ and the identity permutation. If we think of σ as a deck of cards, then this is the minimum number of moves needed to sort the deck. Thus, it is a natural measure of the degree of sortedness of a sequence.
An estimation of sortedness may be useful in data streams corresponding to network routing. For example, a router is a computer networking device that forwards data packets across a network toward their destinations. In several protocols, including, internet protocols (IP), a packet flow (i.e., a sequence of packets that is sent from a single source to a single destination) is not guaranteed to maintain its order. That is, the packets are not guaranteed to arrive in the order in which they were sent. Typically, packets that arrive out of order indicate that the path used to route the flow is suboptimal. For example, the flow may be routed using multiple paths in the network, and if one of these paths is significantly more congested than the others, packets using this path will be routed much slower than the other packets. Typically, the sender annotates the packets in a flow with increasing numeric identifiers and, therefore, the destination node (and also routers along the way) can estimate the quality of the current routing policy by measuring the sortedness of the received packets.
An estimation of sortedness may also be useful when comparing very long rankings (i.e., ordered lists of distinct items). For example, a ranking may describe all the pages on the web ordered by some score function and there may be a need to compare today's ranking with that of yesterday. In this case, one of the rankings will play the role of the increasing sequence and the crucial issue is to be able to determine the order of two items (according to the first ranking). Clearly, if the first ranking is assumed to be increasing, then an estimation of sortedness can be used immediately to compare the second ranking with the first ranking. Otherwise, this first ranking requires some computation which may be provided by a suitable service (or a server). Even though the ranking is very large, accessing this service may be very fast if it is actually implemented using a large-scale distributed system with many servers (which is a common infrastructure for web search engines).
However, as mentioned above, estimating sortedness in a feasible manner (i.e., without having to scan the data stream more than once and/or without having to actually sort the entire sequence) is difficult. Therefore, there is a need in the art for a method that requires only one pass over a sequence of data elements, adequately samples data elements from the sequence during that one pass and estimates how close the sequence is to being sorted based on the samples.