This invention was made with U.S. Government support under contract no. NCC5-305 awarded by the National Aeronautic and Space Administration (NASA). The U.S. Government may have certain rights in this invention.
1. Field of the Invention
The present invention relates to database systems and, more particularly, to constructing, maintaining, and utilizing a multidimensional indexing structure to answer linear optimization queries to a database containing records with numerical attributes.
2. Background of the Invention
A linear optimization query is a special type of database query which returns database records whose weighted linear combination of numerical attributes are ranked as the top N records among the entire database, either maximally or minimally. Equivalently, a linear optimization query may be posed as the problem of finding data records whose weighted values are above or below a threshold. Out of the returned results, the top N records are then selected. While such a query may request the maximal or minimal N records based on a specific linear optimization criteria, the query processing algorithm does not require separate procedures for the two optimization conditions. This is because by simply reversing signs of weights in the linear equation, a maximization problem is translated into a minimization one, and vice versa. The present invention processes optimization queries in a similar way by translating them into maximization queries first.
Depending on application scenarios, weights (coefficients) of the linear criterion may or may not be known at the data ingestion time. Were they known during data ingestion time and remain constant, the weighted linear combination could be pre-computed and stored to answer future queries. In many cases, the coefficients are dynamic and the same set of data records are shared by different applications. Pre-computing for all applications thus may not be feasible. An emphasis of the present invention is on the dynamic cases where the coefficients are unknown and determined at the query moment. A goal of this invention is to index the records in an efficient way such that when a new query is issued, only a fraction of records in the database need to be evaluated to satisfy the query. Although its query response time may not be as fast as the response time of a static query, our invention narrows the performance gap between the two.
The linear optimization query is a general case of linearly weighted ranking, which is vastly applied in information retrieval and summarization. Instead of presenting a long table with all surveyed parameters of every record, useful information is often summarized by taking a linearly weighted combination of those parameters. The top N records are then listed and discussed. Examples of such information summarization can be found in many places. For example, every year, the news magazine US News and World Report conducts studies of college education and ranks the school performance by a linear weighting of numerical factors such as academic reputation (25%), faculty resources (20%), retention rate (20%), student selectivity (15%), financial resources (10%), alumni giving (5%), and graduation rate performance (5%). Top-ranking national and regional colleges are listed. One can find many similar examples such as cities with the highest cost of living, towns with the highest crime rate, the five hundred largest global companies, and so on. While all these examples are based on linearly weighted ranking, the coefficients assigned to the linear criterion are mostly static. The allocation of linear weighting may reflect the opinion of information collectors such as news agencies or consumer opinion groups. However, information subscribers like magazine readers do not actively participate in the information summarization process. We argue information subscribers should be active participants of the information retrieval and summarization process. In the above examples, linear weighting and record ranking can be performed at the request of readers and subscribers, perhaps through a personalized web page. College applicants should be able to choose a set of coefficients that reflect to their own valuation of a school. City residents should decide what cost of living index appears in the ranking criterion by their own life styles. One formula does not apply to all people.
Dynamic information summarization in the form of adjusting weights of the linear criterion has been practiced in many business and scientific applications. For example, mortgage companies and banks develop linear models to estimate consumers"" credit scores, probabilities of mortgage repayment, default risk, etc. These models are often built on a common set of parameters such as loan-to-value ratio, length of credit history, revolving credit, credit utilization, debt burden and credit behavior. From this set of parameters, models for financial products may be developed. In the area of public health and environmental science, scientists extract parameters from satellite images, digital elevation maps, and weather data to model disease outbreak probabilities, rodent population, air pollution, etc. As an example, a group of researchers from Johns Hopkins University, A. Das, S. R. Lele, G. E. Glass, T. Shields, and J. A. Patz, developed a model of the distribution of the population of Lyme disease vectors in Maryland from Geographical Information System (GIS) digital images (See xe2x80x9cSpatial modeling of vector abundance using generalized linear mixed models: application to Lyme disease,xe2x80x9d submitted to Biometrics for publication). Their models are frequently revised by applying different statistical analysis techniques and training data sets. In addition, scientists like to adjust their models to ask xe2x80x98whatxe2x80x99 if questions. A speedy and accurate response from the database would greatly assist model development and verification.
The study of multidimensional indexing structures has been a major subject in database research. Indexing structures have been developed to answer different types of queries, including:
1. find record(s) with specified values of the indexed columns (exact query);
2. find record(s) that are within [a1 . . . a2], [b1 . . . b2], . . . ,[z1 . . . z2] where a, b, and z represent different dimensions (range query);
3. find the K most similar records to a user-specified template or example (K-nearest neighbor query); and
4. find the top N records to a user-specified linear optimization criterion (linear optimization query).
Substantial work can be found to address the previous three types of queries, while much less is available in prior art about the fourth one. In prior art, linear optimization queries are often referred to the problem of finding a single data entry which maximizes or minimizes the given linear criterion, with the assumption that the constraints are given in the form of linear inequalities. In such cases, the feasible solution space is the intersection of half-spaces defined by those linear inequalities. When both the query and constraints are given at query time, the query processing problem is a linear programming problem Solutions such as the simplex method and the ellipsoid method were well studied and references can be found in most linear programming textbooks. In addition, recent discovery in randomized algorithms suggested possible ways to reduce expected query response time. Seidel reported the expected time is proportional to the number of constraints (R. Seidel, xe2x80x9cLinear programming and convex hulls made easy,xe2x80x9d Proceedings of the 6th ACM Symposium on Computational Geometry, pp. 211-215, 1990). When the constraints are given ahead of time to enable the preprocessing of records, query response can be made faster by trading off storage space. Matousek reported a data structure that is based on a simplicial partition tree, while parametric search is applied to prune the partition tree (J. Matousek and O. Schwarzkopf, xe2x80x9cLinear optimization queries,xe2x80x9d Proceedings of the 8th ACM Symposium on Computational Geometry, pp. 16-25, 1992). Matousek provided complexity estimates on preprocessing time, storage space, and query response time. His work, however, does not suggest any direct extension to answer top-N linear optimization queries. Chan applied the same data structure while randomized algorithms are applied for tree pruning (T. M. Chan, xe2x80x9cFixed-dimensional linear programming queries made easy,xe2x80x9d Proceedings of the 12th ACM Symposium on Computational Geometry, pp. 284-290, 1996).
It is possible to apply data structures for linear constraint queries and post-process the outputs. The query processor does not search for the top-N records directly. Instead, it retrieves all records that are greater than a threshold. These records are then sorted to find the top-N answers. Studies in linear constraint queries tend to rely on spatial data structures such as R-tree and k-d-B tree. Algorithms are developed to prune the spatial partition tree to improve response speed. Examples of such studies can be found in the paper by J. Goldstein, R. Ramakrishnan, U. Shaft, and J. Yu, xe2x80x9cProcessing queries by linear constraints,xe2x80x9d Proceedings of ACM PODS, pp. 257-267, 1997 and the paper by P. K. Agarwal, L. Arge, J. Erickson, P. G. Franciosa, and J. S. Vitter, xe2x80x9cEfficient searching with linear constraints,xe2x80x9d Proceedings of ACM PODS, pp. 169-177, 1998.
As will be evident, there are several major differences between the present invention and the prior art. First, for example, the invention applies a different indexing structure which solely depends on the geometric distribution of data records. Scaling, rotating, or shifting their attribute values has no effect on the indexed results while these operations significantly change those traditional indexing structures. Second, for example, the invention does not require a post-processing step to sort the output values while linear constraint queries do. Outputs are guaranteed to be returned in the order desired, which enables a form of xe2x80x98progressivexe2x80x99 retrieval. Third, for example, this invention enables a simple hierarchical organization of index to accommodate both global and localized queries. A database record typically contains both categorical and numerical attributes. A localized query is issued to search records from a single or multiple categories. On the other hand, a global query is issued to search records in the whole database. A solution to index the whole database must address both needs efficiently and avoid redundant storage. Our invention provides such a solution.
The present invention is directed to methods and apparatus for constructing, maintaining and utilizing a multidimensional indexing structure for processing linear optimization queries. The present invention enables fast query processing and has minimal storage overhead. Experimental results show that a significant improvement on query response time has been achieved. For example, two orders of magnitude in speed-up over a linear database scan has been demonstrated in retrieving the top 100 records out of a million.
As is known, coefficients of a linear equation are given at the query moment, which prevents a database from pre-computing and storing the answer. An indexing structure therefore should be flexible to localize the fraction of the database which contains relevant data records. The present invention provides such an indexing structure that enables the reduction of query response time by selectively evaluating some of the data records rather than all of the records in the database.
In one aspect, the invention discloses layered convex hulls as the fundamental building block of this multidimensional indexing structure. We present algorithms that are used to construct, maintain, and utilize a layered convex hull to process queries. In addition, we disclose a hierarchical structure of layered convex hulls, which is built upon multiple convex hulls by selectively grouping them into a hierarchy. This hierarchical structure provides an efficient and scalable solution to both global and localized queries.
In this invention, a layered convex hull is constructed by dividing database records into multiple layers wherein at least a portion of an inner layer (preferably, the entire inner layer) is geometrically contained by (i.e., inside of) a preceding outer layer. That is, each of the layers represents a convex hull to all the records from the current layer inward. It is to be appreciated that while a preferred method of construction is to create layers from the outer layer inward, it is contemplated that one of ordinary skill in the art can create layers from the inner layer outward. The fundamental theorem of linear programming guarantees, based on a basic property of a convex hull, that the linear maximum and minimum of a set of points always happen at their convex hull. In a layered convex hull, every record belongs to a layer. The query processing of linear optimization evaluates records layer-by-layer until the requested number of records are returned. Returning records retrieved by the algorithm disclosed in this invention are ordered by the given linear criterion and therefore, the query processing may be stopped at any point. No further operations are needed to sort the returned results.
Advantageously, this invention enables a hierarchical indexing structure to accommodate both global and localized queries. Global queries apply to all of the data records in a database. Localized queries apply to some segments or categories of data records. The hierarchical structure is built upon multiple xe2x80x98localxe2x80x99 layered convex hulls by extracting their outer-most layers; constructing a layered convex hull from records of these outer-most layers; and storing the new hull as the xe2x80x98parentxe2x80x99 of the xe2x80x98localxe2x80x99 hulls. When a new query is issued, the query processor first locates the parent hull of the record segments of interest. Layers in the parent hull are then evaluated to discover if any of its local hulls need to be evaluated. For data records exhibiting dissimilar distributions, the hierarchical indexing structure is most effective in pruning the search space and confining queries to local hulls that are most relevant. Effective pruning further shortens query response time and improves performance.
In yet another aspect of the invention, methods and apparatus for storing records of layered convex hulls in a spherical shell representation are also provided.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.