The invention pertains to partition size analysis for very large databases having multiple partitions and, more particularly, to accurate, fast, and scalable characterization and estimation of large populations using a random sampling function that is integrated directly into a database engine.
Databases provide a means to conveniently store and retrieve a wealth of information such as, in the business setting, individual and corporate accounts and, in the business example provide a means to analyze business trends and make other business, educational, and scientific decisions. Accordingly, over the years, typical database populations reach upward of a billion rows and records.
Analysis of these large databases for administration and replication purposes typically involves processes which are very input/output intensive, as numerous queries must be performed by an analysis program across a vast number of records. Random sampling by an associated application program outside of the database management system (DBMS) can reduce the number of records analyzed. However, the number of requests passed from an analysis program to the DBMS remains high because requests must be made not only for selected records but also to skip non-selected records.
It would be beneficial to provide a method and system for administration and replication of large databases including a means for partition size analysis that reduces the amount of time required to perform the analysis so that such analyses can be executed in a more timely basis without placing an undue burden on the computer system hosting the database. The ability to perform size analyses in a timely basis allows database managers to monitor growth patterns and to accurately estimate needs for database reorganization in terms of predicting the time of a required reorganization and projecting space allocation requirements.
Partition size analyses require only a sufficiently accurate approximate solution as compared to the very precise solution obtainable by analyzing each and every item of data in a database. It is of little worth to provide a precisely accurate solution for a volatile database that is constantly changing including changing at the very moment that it is being analyzed. It is typically not possible to provide an exact analysis without first removing a database from online for an extended period of time. For size analyses, only a small portion of the full set of data must be processed to provide an accurate estimate of partition size, especially for very large homogeneous databases.
The present invention provides a method and system for performing database characterization and approximation analyses to generate very precise, as well as timely results. The method is based on first deriving a random sample of known size from a database of unknown size, or known size, and then extrapolating the results to provide an accurate approximation of a full-scale analysis.
The method and system provided are unique in that a random sample is selected of predetermined known size, but uniformly distributed across the entire database, from a database of known or unknown size while reading only a fraction of the records in the database without the requirement of indexing the entire database which, as indicated above, is time consuming and provides results having an unnecessary degree of precision. The sampling facility is provided as a built-in feature of the database management system and not simply attached to the DBMS as an associated external application. This enables earlier pruning and better performance because the sampling function is closer to the source database.
Other previous random sampling techniques typically require that the database be indexed in order not to read the entire database, or read the entire database and randomly select samples from the entire result. As an example, U.S. Pat. No. 5,675,786 provides a simple sampling function in a database engine. The sampling function taught there generates a sequential stream of initial results by first addressing a query to the database and then sampling the stream of initial results to produce a sampled result substantially smaller than the initial result.
The present invention, on the other hand, retrieves only a user-selectable fraction of the records stored in the database. This advantageously improves the overall performance of the system and accuracy of the results.
In order to produce samples of predetermined size that are normally distributed across a database typically requires a knowledge of the exact number of records in the database beforehand. As an alternative to prior knowledge of the number of records, a complete scan of the database prior to sampling is needed. For example, the '786 patent identified above requires that a particular sampling probability be selected in order to produce a particular sample size from a given result. The present invention, however, overcomes this requirement.
The present invention therefore provides a solution to the aforementioned problems, and offers other advantages over the prior art.