The present invention generally relates to data streams, and more particularly relates to measuring distance between data in a data stream.
Recent years have witnessed an explosive growth in the amount of available data. Data stream algorithms have become a quintessential tool for analyzing such data. These algorithms have found diverse applications, such as large scale data processing and data warehousing, machine learning, network monitoring, and sensor networks and compressed sensing. A key ingredient in all these applications is a distance measure between data. In nearest neighbor applications, a database of points is compared to a query point to find the nearest match. In clustering, classification, and kernels, e.g., those used for support vector machines (SVM), given a matrix of points, all pairwise distances between the points are computed. In network traffic analysis and denial of service detection, global flow statistics computed using Net-Flow software are compared at different times via a distance metric. Seemingly unrelated applications, such as the ability to sample an item in a tabular database proportional to its weight, i.e., to sample from the forward distribution, or to sample from the output of a SQL Join, require a distance estimation primitive for proper functionality.
One of the most robust measures of distance is the l1-distance (rectilinear distance), also known as the Manhattan or taxicab distance. The main reason is that this distance is robust is that it less sensitive to outliers. Given vectors x, yεn, the l1-distance is defined as
                          x        -        y                    1    ⁢      =    def    ⁢            ∑              i        =        1            n        ⁢                                                x            i                    -                      y            i                                      .      This measure, which also equals twice the total variation distance, is often used in statistical applications for comparing empirical distributions, for which it is more meaningful and natural than Euclidean distance. The l1-distance also has a natural interpretation for comparing multisets, whereas Euclidean distance does not. Other applications of l1 include clustering, regression (and with applications to time sequences), Internet-traffic monitoring, and similarity search. In the context of certain nearest-neighbor search problems, “the Manhattan distance metric is consistently more preferable than the Euclidean distance metric for high dimensional data mining applications”. The l1-distance may also support faster indexing for similarity search.
Another application is with respect to estimating cascaded norms of a tabular database, i.e. the lp norm on a list of attributes of a record is first computed, then these values are summed up over records. This problem is known as l1(lp) estimation. An example application is in the processing of financial data. In a stock market, changes in stock prices are recorded continuously using a rlog quantity known as logarithmic return on investment. To compute the average historical volatility of the stock market from the data, the data is segmented by stock, the variance of the rlog values are computed for each stock, and then these variances are averaged over all stocks. This corresponds to an l1(l2) computation (normalized by a constant). As a subroutine for computing l1(l2), the best known algorithms use a routine for l1-estimation.