The present invention concerns data mining of large database and more particularly to an improved method and apparatus for answering queries based on the data contained in large databases.
A database management system (DBMS) stores data reliably and retrieves the data based on a data access language, typically SQL (Structured Query Language). One major use of database technology is to help individuals and organizations obtain fast/flexible reports on performance of a business. Another major use of DBMS is to help make decisions and generate reports based on the data contained in the database.
An important aid to the users making decisions based on data in the database is a multidimensional view of data. For example, in a payroll application, the data could be represented with three dimensions such as salary, years of employment and vested dollar amounts in a company pension plan. With this view of the data, it is easy to ask queries such as query that finds the average number of years of employment of employees who have a salary greater than 100,000 dollars. Alternatively, it might be desurable to specify a range of years of employment and ask for the total contribution in dollars for people falling in that range. The multidimensional approach facilitates the view of the data and also does not distinguish between the dimensions in which ranges can be specified and the dimensions on which sums and aggregations are performed. Further, users can specify ranges in different dimensions if they need to view data in more detail. Thus, in the above example, users could ask for the average salary of employees who have more than 100,000 dollars in vested company benefits and who have worked for the company for less than 5 years. Other applications that benefit from the multidimensional view of data include inventory control, sales and marketing, user, or customer databases.
One of the emerging applications of databases is online analytical processing (OLAP) which enables users to ask decision support queries in a more natural and efficient manner than SQL-based database systems. A commonly used approach for queries that use OLAP is the so-called data cube approach which views the data records stored in the database as points in a multidimensional space. Users ask queries using the data cube approach by specifying in the query the dimensions or record attributes of interest and ranges for those dimensions.
One commonly asked OLAP query is a count query that determines the number of data records falling within a selected range. A second common query is the aggregate query which totals the values for the records falling within a selected range. By adding or deleting a dimension to a query a user can drill down or roll up over the multi-dimensional space of the database.
Transact-SQL (T-SQL) is a superset of standard SQL and forms part of SQL server version 6.5, a product of Microsoft Corporation, assignee of the present invention. Transact SQL provides two special operators, CUBE and ROLLUP, that allow multidimensional analysis to be projected by the database server where the analysis can be optimized as part of the execution plan of the query. A discussion of the CUBE and ROLLUP operators is found in the book xe2x80x9cInside SQL Server 6.5 xe2x80x9d by Soukup, Copyright 1997, Microsoft Press, pp 326-340 which is incorporated herein by reference.
As data warehousing becomes more popular, OLAP is gaining in importance as a primary interface to evaluating data contained in the data warehouse. Most successful data mining applications include reporting systems having fast query response mechanisms. Most corporations require decision support and would benefit from improved technology to help in making decisions based upon rapidly gathered and organized data.
Specific applications include marketing aided by querying past sales data, inventory management based on querying the inventory database, hierarchical visualization of employee data and many other decision support applications. The applications of OLAP are not limited to large organizations. The concept of the data cube could be viewed as a hierarchical organization of data in general and this could be an easily understood interface that provides summarized and organized access to large volumes of data to every type of user.
One goal of a database management system is efficient support for queries on data cubes. The prior art has (a) dealt with providing exact answers to queries and (b) dealt mainly with discrete valued dimensions and (c) treated the dimension on which the aggregation or average is performed as a dimension that is distinct from the dimensions of the data cube. Because prior art techniques concentrated on providing exact answers, the amount of data that was stored was large and expensive disk accesses were required, thus making query processing expensive. Not being able to efficiently pose queries on continuous valued dimensions limits the types of applications and the scope of data cubes. Treating all dimensions symmetrically and allowing aggregation on every dimension is not possible with such prior art systems.
Being able to quickly query the data presented in the multidimensional format is crucial to the effective usage of a data cube. Typically, answers to the queries on the data cube are not required with perfect accuracy and users are willing to trade off accuracy for rapid answers to these queries.
The present invention enables the effective summarization of large amounts of data and provides rapid answers to decision support queries on data cubes. A principle feature of the invention is the recognition that data can be viewed as points in a multidimensional space and hence can be effectively summarized using clustering techniques.
An exemplary embodiment of the present invention concerns method and apparatus for querying a database containing data records stored in the database. A cluster model is formed from data records that are stored in the database. An initial cluster model having an initial probability distribution describing data records in the database is thus provided. By comparing the initial probability distribution with a representative sample of records in the database a sufficiency of this initial probability distribution is determined. The cluster model is then modified to provide an adjusted cluster model that better characterizes the data in the database. The cluster modification is performed by adjusting the cluster model to reduce discrepancies between the initial probability distribution and data sampled from the database.
Once the adjusted cluster model is provided, a sum or a count of data records from the database falling within specified ranges of the multiple dimensions is determined by integrating a functional representation based on the probability distribution of the adjusted cluster model over the ranges. In one embodiment of the invention the cluster model is adjusted by increasing the cluster number of the cluster model and reclustering at least a portion of the data in the database.
Choice of a starting point for a new cluster is performed by finding a region within an attribute space of the data records of high discrepancy between the initial cluster model and a sample of data gathered from the database. Since this region is not well modeled by initial cluster model, it serves as a candidate point for a new cluster and will result in a shifting of existing clusters and formation of a new cluster near the region of high discrepancy.
Records in the database have one or more attributes containing continuous, nondiscrete data values. Each of these cluster attributes is characterized by a Gaussian functional depiction of the cluster that can be integrated (using numerical approximation techniques) over the data space of the records.
Experience with data clustering has indicated that clustering models do not well model so called outlying data records from removed from any cluster mean. In one embodiment of the invention the functional representation excludes certain data records by identifying outlying data records as not falling within a data cluster. The process of determining a sum or a count total used in responding to queries of the database adds contributions from the outlying data records to the results of the integrating step.
The present invention can be used to find the number of clusters and good initialization points for the clusters in both discrete and continuous space. The method works by performing an intelligent search for the clusters such that the time required to both find an optimum cluster number and perform the clustering is on the order of the time required to cluster the data if the number of clusters were known beforehand. This invention is thus a means for xe2x80x9coptimallyxe2x80x9d clustering a large-scale database in the situation when the user has little or no prior knowledge about the actual number of clusters.
The invention identifies areas of multidimensional space that have a higher or lower density of points than is predicted by the current model (clusters). The current model is then further refined by growing new clusters in those areas so that the new model better fits the data. The method proceeds stops when the all areas in the multidimensional space fit xe2x80x9csufficientlyxe2x80x9d well to the model. The process chooses a value for the cluster number K.
An exemplary embodiment of the invention starts with one cluster having an arbitrarily assigned mean or centroid for each dimension of data within the database. This one cluster is selected without accessing the data in the database. Using this starting point data is retrieved or read from the database and a cluster model is built from the data using the single starting cluster.
The cluster model is next broken up into areas (tiles) based on user defined parameters. Each cluster is split to a number of tiles based on the data. Data from the database is then classified based on the tiling information. A sorted version of the classified data, ordered by cluster number and then by the tile number within the cluster is generated. This data is then evaluated to test the sufficiency of the model created during the clustering. If the model is insufficient (inaccurate) in modeling the data then new cluster initialization points are generated. The data is again clustered and a clustering model created. The sufficiency of this model is then again evaluated and the process of clustering stopped when a xe2x80x98goodxe2x80x99 model has been created.
For large databases a refinement is to cluster using only a sample of the data but to perform classification on the entire data set. For large databases this model that has been judged to be sufficient provides an initialization point for a full clustering of the data in the database by one of a number of possible clustering procedures.
It is possible to approximately compare the cost of clustering the data when the number of clusters is know versus the cost in determining the cluster number and then clustering. The invention enables effective and accurate clustering in computational times that are approximately a factor of two greater than optimal clustering algorithms that have complete knowledge of the number of clusters before clustering (a unrealistic assumption in real world applications). Further, the proposed technique is applicable for both continuous and discrete clustering. A special feature of this method is that it determines good initial starting points for the clustering process in addition to determining K. Further, this method can be used with most variants of clustering algorithms, including scaleable clustering algorithms that cluster in just one pass over the data using a limited amount of computer memory.