This invention generally relates to information processing and management of data. Particularly, the invention relates to efficient searches of data sets using metric spaces, multigrid indexes, and B-grid trees, and, more particularly, to find xe2x80x9chitsxe2x80x9d for a query of large data sets in a metric space.
Advances in computing technologies significantly impact our lives. From simple scheduling to advanced whole genomic analysis, computing and computing applications are at the center of these activities. Among the many facets of computing, data search and management are the most employed and, what some may consider, to be the most critical functions performed. Data search provides quick access to desired data that may be buried in large data sets. As such, search features are extensively employed in data intensive processing applications such as DNA or protein sequence comparisons. A key factor of data searches is efficiency, both time efficiency and computational efficiency.
The search to find an entry in a data set of size N entries, having no specific order, may be extremely time intensive depending on the search system or method that is employed. For example, a simple search on a data set of a million entries without any specific order will generally require one million comparisons. That is, each data element of the data set is processed during the search to provide hits that contain the desired entry. A hit to a query is generally defined as those points that meet the specified neighborhood criteria in the metric space. In the context of efficiency, the search complexity required for a search may be quantified by the order function O( ). Accordingly, the search complexity for this example can be characterized as O(N). Advances in computer/information science, however, have reduced the search complexity to O(log N). Using these advances, the one million comparisons of the above example can be reduced to 20 comparisons in a binary search.
The reduction in search complexity is attributed to the performance of pre-processing. Data pre-processing involves the creation of data indexes prior to search such that the search is performed on the created indexes instead of each data element of the data set. The data can be pre-processed according to a xe2x80x9csortxe2x80x9d process, or by developing balanced multiway trees (B-trees). Although the pre-processing of data contributes to an one-time increase in computational complexity of O(NlogN) (i.e. the number of steps required in preparing the data before performing the search), the search complexity is significantly decreased. It is reduced from O(N) to O(log N).
Algorithms to perform the aforementioned pre-processing are known in the art and are described by M. A. Weiss, Data Strucutre and Algorithm Analysis In C++ (Addison-Wesely Publishing Co., 2 edition, 1998); R Sedgewick, Algorithms in C++, (Addison-Wesley Publishing Co., 1992); R. F. Gilberg, B. A. Forouzan, Data Structures, A Pseudocode Approach with C (PWS Publishing Co. 1998); and R. Bayer and E. McCreight, Organization and Maintenance of Large Ordered Indexes, Acta Informatica, Vol. 1, No. 3, pp. 173-189, 1972 which are herein incorporated by reference. Current pre-processing algorithms (e.g. sorting, B-Tree) may be performed in one dimensional data space or in multidimensional space.
As an alternative to sorting in a metric space is to perform k-clustering of the data. A k-clustering of a set N is a partition of the set into k subsets which optimizes some clustering criterion. K-clustering aims to solve the problem that given a weighted graph G=(X, d) on N vertices, where d( . , . ) is the weight function, partition X into k sets S1, . . . , Sk, such that the value of:
xe2x80x83xcexa3ixcexa3{u, v in Si}d(u, v)
is minimized. Some of the algorithms developed are described by S. Sahni, T. Gonzalez, P-Complete Approximation Problems, JACM (23), 1976, pp555-566; N. Garg, V. V. Vazirani, M. Yannakakis, Approximate Max-Flow Min-(Multi)-Cut Theorems and Their Applications. SICOMP (25), 1995, pp235-251; W. F. de la Vega, C. Kenyon, A Randomized Approximation Scheme for Metric MAX_CUT. 1998. FOCS""98, pp468-471; and Piotr Indyk, A Sublinear Time Approximation Scheme For Clustering In Metric Spaces. 1999. FOCS""99, pp154-159 which are herein incorporated by reference. However, k-clustering has not been used to build searchable trees in metric space. As such, a particularly useful data preprocessing method is not being exploited to its fullest potential.
Current sorting algorithms, such as, binary sort, merge sort, or quick sort, arranges the data according to a particular structure (i.e. tree, linked list, or array). By themselves, these algorithms generally provide a fast way of locating an exact item in a data set, however, they do not provide a method of locating xe2x80x9capproximatexe2x80x9d or xe2x80x9csimilarxe2x80x9d elements to a search query. Several solutions have been developed to accommodate for this drawback. These solutions include hardware solutions to increase the computational efficiency and thereby reduce the search time, and software solutions that employ additional algorithms to the sorting algorithms to quickly xe2x80x9cscreenxe2x80x9d out dissimilar elements. These solutions are described in S. Rosenthan. The PF474 Chip, A Coprocessor for String Comparison. Byte, 1984; U.S. Pat. No. 4,490,811, entitled xe2x80x9cString Comparator Device System Circuit and Methodxe2x80x9d; and Patrick A. V. Hall and Geoff R. Dowling, Approximate String Matching, ACM Computing Surveys, Vol. 12, No. 4, pp381-402, December, 1980; U.S. Pat. No. 5,978,797, entitled xe2x80x9cMultistage Intelligent String Comparison Methodxe2x80x9d which are herein incorporated by reference. However, these solutions can be error prone as certain relevant data elements may be overlooked.
From the foregoing, it can be appreciated that there exists a need for systems and methods to search data sets using metric space that provide exact and more relevant xe2x80x9capproximatexe2x80x9d search results such that search complexity is significantly reduced. By having these systems and methods the drawbacks of the prior art may be overcome.
The present invention provides systems and methods to search data sets using a metric-space theoretical approach to solve xe2x80x9capproximatexe2x80x9d or xe2x80x9cinexactxe2x80x9d searching problems. In an illustrative implementation, traditional xe2x80x9csortingxe2x80x9d processes performed in 1-dimensional space are extended to general metric spaces. In this implementation, a multigrid search tree is generated by calculating distance functions of the data in metric space for the data set desired to be searched. The multigrid tree provides an index for each data element of the data set, such that a submitted search query is compared against the multigrid tree to find exact or xe2x80x9chitxe2x80x9d matches, where a xe2x80x9chitxe2x80x9d may be defined as an inexact, but close enough match based on some criteria. In an application of the method to genetic sequence search problem, outputs from BLAST and FASTA may be used to calculate local distances between data elements, which can be used to create local multigrid trees to provide very efficient search algorithms to process submitted search queries. In this implementation, the multigrid tree may comprise a balanced multigrid tree (B-grid) such that the data elements of the data set being searched are partitioned in equal size grids ensuring more homogeneous search time thereby reducing computational complexity.
Other aspects of the present invention are described below.