Detecting similar objects is a core component in numerous computer science applications. Concrete examples include, but are not limited to, the detection of similar documents in large corpora for plagiarism detection, clustering similar emails according to keywords for spam detection, detecting defective genes that appear to contribute as combinations to certain diseases, collaborative filtering in recommender systems where users are grouped according to similar interests, etc.
There are different similarity measure definitions that have been applied. See <<http://reference.wolfram.com/language/guide/DistanceAndSimilarityMeasures.html>> (accessed Jan. 29, 2015) for an overview. Some of the similarity measure definitions like Hamming distance, Jaccard similarity, Dice similarity, etc, assume binary data as input. For many problems however, this assumption is not justified and one needs to handle weighted features. Arguably, the three most widely used similarity measures for weighted data are Euclidean distance, cosine similarity and Pearson correlation.
For certain applications like recommender systems and genetic data mining, the established similarity measures are cosine similarity and Pearson correlation. See Michael D. Ekstrand, John T. Riedl and Joseph A Konstan, Collaborative Filtering Recommender Systems, Foundations and Trends in Human-Computer Interaction, Vol. 4, No. 2 (2010) as an example.
Formally, cosine between two objects x and y is defined as
      cos    ⁡          (              x        ,        y            )        =                    ∑                  i          =          1                n            ⁢                        x          i                ⁢                  y          i                                            x                    ⁢                      y                    
where xi denotes the i-th feature of object x and ∥x∥=√{square root over (Σi=1nxi2)} is the 2-norm of the vector a.
Pearson correlation is defined as
      ρ    ⁡          (              x        ,        y            )        =                    ∑                  i          =          1                n            ⁢                        (                                    x              i                        -                          x              ~                                )                ⁢                  (                                    y              i                        -                          y              ~                                )                                                      x          _                            ⁢                                y          _                            
where {tilde over (x)}=(Σi=1nxi)/n and ∥x∥=√{square root over (Σi=1n(xi−{tilde over (x)})2)}
The problem to compute the similarity between two objects by the above definitions is trivial if it is possible to store the objects in main memory. However, for massive datasets with high-dimensional objects, it is often the case that it is not possible to store all of the objects in main memory. Therefore, one aims to efficiently compute compact sketches or summaries of the objects that will lead to considerable space savings.
In the following, it is assumed that objects are described by vectors and the terms object and vector are used interchangeably. It is also assumed that an input vector is provided as a stream of (index, value) pairs with no particular order.
Previous approaches for similarity estimation include min-wise independent permutations (see Andrei Z. Broder, Moses Charikar, Alan M. Frieze, Michael Mitzenmacher, Min-Wise Independent Permutations (1998)) for Jaccard similarity and a random hyperplane algorithm (see Moses Charikar, Similarity Estimation Techniques from Rounding Algorithms (2002)) for the estimation of the angle between vectors revealed in a streaming fashion. The former applies only to binary data, and the latter suffers from higher processing time per element which makes it impractical for high speed data streams. Count-Sketch has also been applied to inner product estimation (see Graham Cormode, Minos Garofalakis, Sketching Streams Through the Net: Distributed Approximate Query Tracking, VLDB (2005), pp. 13-24) which is closely related cosine similarity estimation. Count-Sketch is also described in Moses Charikar, Kevin Chen, Martin Farach-Colton, Finding Frequent Items in Data Streams, Theor. Comput. Sci. 312(1) 3-15 (2004). To the best of the inventors' knowledge, no sketching technique has been proposed to Pearson correlation estimation.