This invention relates in general to a relational database management system, and in particular, to an analytic algorithm implemented in SQL for performing cluster analysis in very large databases.
Relational databases are the predominate form of database management systems used in computer systems. Relational database management systems are often used in so-called xe2x80x9cdata warehousexe2x80x9d applications where enormous amounts of data are stored and processed. In recent years, several trends have converged to create a new class of data warehousing applications known as data mining applications. Data mining is the process of identifying and interpreting patterns in databases, and can be generalized into three stages.
Stage one is the reporting stage, which analyzes the data to determine what happened. Generally, most data warehouse implementations start with a focused application in a specific functional area of the business. These applications usually focus on reporting historical snap shots of business information that was previously difficult or impossible to access. Examples include Sales Revenue Reporting, Production Reporting and Inventory Reporting to name a few.
Stage two is the analyzing stage, which analyzes the data to determine why it happened. As stage one end-users gain previously unseen views of their business, they quickly seek to understand why certain events occurred; for example a decline in sales revenue. After discovering a reported decline in sales, data warehouse users will then obviously ask, xe2x80x9cWhy did sales go down?xe2x80x9d Learning the answer to this question typically involves probing the database through an iterative series of ad hoc or multidimensional queries until the root cause of the condition is discovered. Examples include Sales Analysis, Inventory Analysis or Production Analysis.
Stage three is the predicting stage, which tries to determine what will happen. As stage two users become more sophisticated, they begin to extend their analysis to include prediction of unknown events. For example, xe2x80x9cWhich end-users are likely to buy a particular productxe2x80x9d, or xe2x80x9cWho is at risk of leaving for the competition?xe2x80x9d It is difficult for humans to see or interpret subtle relationships in data, hence as data warehouse users evolve to sophisticated predictive analysis they soon reach the limits of traditional query and reporting tools. Data mining helps end-users break through these limitations by leveraging intelligent software tools to shift some of the analysis burden from the human to the machine, enabling the discovery of relationships that were previously unknown.
Many data mining technologies are available, from single algorithm solutions to complete tool suites. Most of these technologies, however, are used in a desktop environment where little data is captured and maintained. Therefore, most data mining tools are used to analyze small data samples, which were gathered from various sources into proprietary data structures or flat files. On the other hand, organizations are beginning to amass very large databases and end-users are asking more complex questions requiring access to these large databases.
Unfortunately, most data mining technologies cannot be used with large volumes of data. Further, most analytical techniques used in data mining are algorithmic-based rather than data-driven, and as such, there are currently little synergy between data mining and data warehouses. Moreover, from a usability perspective, traditional data mining techniques are too complex for use by database administrators and application programmers, and are too difficult to change for a different industry or a different customer.
One analytic algorithm that performs the task of modeling multidimensional data is xe2x80x9ccluster analysis.xe2x80x9d Cluster analysis finds groupings in the data, and identifies homogenous ones of the groupings as clusters. If the database is large, then the cluster analysis must be scalable, so that it can be completed within a practical time limit.
In the prior art, cluster analysis typically does not work well with large databases due to memory limitations and the execution times required. Often, the solution to finding clusters from massive amounts of detailed data has been addressed by data reduction or sampling, because of the inability to handle large volumes of data. However, data reduction or sampling results in the potential loss of information.
Thus, there is a need in the art for data mining applications that directly operate against data warehouses, and that allow non-statisticians to benefit from advanced mathematical techniques available in a relational environment.
To overcome the limitations in the prior art described above, and to overcome other limitations that will become apparent upon reading and understanding the present specification, the present invention discloses a method for performing cluster analysis in a relational database management system utilizing an analytic algorithm implemented in SQL. The method defines a plurality of tables for the storage of data points and Gaussian mixture parameters and executes a series of SQL statements implementing a clustering algorithm to update the Gaussian mixture parameters stored within the tables.
In the described embodiment, the method is applied to perform clustering within a relational database management system to group a set of n data points into a set of k clusters, each data point having a dimensionality p. A first table, C, having p columns and k rows, is established for the storage of means values, each column of C representing a cluster; a second table, R, having p columns and p rows, is established for the storage of covariance values; and a third table, W, having k columns and 1 row, is established for the storage of weight values; where n represents the number of data points, k is the number of clusters, and p represents the dimensionality of each data point. SQL commands implementing an Expectation-Maximization clustering algorithm are executed to iteratively update the means values, covariance values and weight values stored in the first, second and third tables.
The above and other objects, features, and advantages of the present invention will become apparent from the following description and the attached drawings.