The invention pertains to partition size analysis for very large databases having multiple partitions.
Information systems have become vitally important to modern businesses, and the growing reliance on information systems has made database design and management a critical task. Many databases have grown to such a large size that multiple partitions are required to accommodate them. As a result, and because of the dynamic nature of the shared data pool contained in large databases, partition size analysis is an important part of planning for future growth.
A problem arises, however, in the amount of time required for an analysis program to traverse a database and compile statistics relating to partition size. It would be beneficial to provide a method of partition size analysis that reduces the amount of time required to perform the analysis so that such analyses can be executed in a more timely basis without placing an undue burden on the computer system hosting the database. The ability to perform size analyses in a timely basis allows database managers to monitor growth patterns and to accurately estimate needs for database reorganization in terms of predicting the time of a required reorganization and projecting space allocation requirements.
Partition size analyses require only a sufficiently accurate approximate solution, as compared to the very precise solution obtainable by analyzing each and every item of data in a database. It is of little worth to provide a precisely accurate solution for a volatile database that is constantly changing including changing at the very moment that it is being analyzed. It is typically not possible to provide an exact analysis without first removing a database from online for an extended period of time. For size analyses, only a small portion of the full set of data must be processed to provide an accurate estimate of partition size, especially for very large homogeneous databases.
The present invention provides a method and system for performing database characterization and approximation analyses to generate very precise, as well as timely results. The method is based on first deriving a random sample of known size from a database of unknown size, or known size, and then extrapolating the results to provide an accurate approximation of a full-scale analysis.
The method and system provided are unique in that a random sample is selected of predetermined known size, but uniformly distributed across the entire database, from a database of known or unknown size while reading only a fraction of the records in the database without the requirement of indexing the entire database which, as indicated above, is time consuming and provides results having an unnecessary degree of precision. The sampling facility is provided s a built-in feature of the database management system and not simply attached to the DBMS as an associated external application. This enables earlier pruning and better performance because the sampling function is closer to the source database.
Other previous random sampling techniques, typically require that the database be indexed in order not to read the entire database, or read the entire database and randomly select samples from the entire result. As an example, U.S. Pat. No. 5,675,786, teaches a system that generates a sequential stream of initial results by first addressing a query to the database and then sampling the stream of initial results to produce a sampled result substantially smaller than the initial result.
In order to produce samples of predetermined size that are normally distributed across a database typically requires a knowledge of the exact number of records in the database beforehand. As an alternative to prior knowledge of the number of records, a complete scan of the database is performed prior to sampling is needed. For example, the '786 patent identified above requires that a particular sampling probability be selected in order to produce a particular sample size from a given result.
The present invention therefore provides a solution to the aforementioned problems, and offers other advantages over the prior art.