1. Field
Embodiments relate to hashing techniques for determining similarity between data sets.
2. Background Discussion
Researchers working in domains as diverse as engineering, astronomy, biology, remote sensing, economics, and consumer transactions, face larger and larger observations and high dimensional data sets on a regular basis. High dimensional data sets result mostly from an increase in the number of variables associated with each observation or data element.
High-dimensional datasets present many mathematical challenges. One such challenge is that, in many cases, not all variables stored with a high-dimensional data set are important for understanding an underlying phenomenon. Thus, it is of interest in many applications to reduce dimensions of original data prior to any modeling of the data.
Furthermore, as data sets become larger and highly multi-dimensional, it becomes increasingly important to represent and retrieve data from data sets in an efficient manner. To determine similar elements between the data sets, ‘nearest neighbor’ algorithms can be used. Nearest neighbor determination schemes, such as locality sensitive hashing (LSH), have been proposed with the goal of approximating a similarity distance metric.
However, conventional nearest neighbor determination schemes are time consuming and require considerable amount of storage space. As a result, data retrieval and similarity detection techniques may not be efficient and can suffer from degradation in performance.