1. Field of the Invention
This invention relates in general to database management systems performed by computers, and in particular, to random sampling of rows in a parallel processing database system.
2. Description of Related Art
Relational Data Base Management Systems (RDBMS) are well known in the art. In an RDBMS, all data is externally structured into tables. A table is a two dimensional entity, consisting of rows and columns. Each column has a name, typically describing the type of data held in that column. As new data is added, more rows are inserted into the table.
Structured Query Language (SQL) statements allow users to formulate relational operations on the tables. One of the most common SQL statements executed by an RDBMS is to generate a result set from one or more combinations of one or more tables (e.g., through joins) and other functions.
One problem in using an RDBMS is that of obtaining one or more mutually exclusive random samples of rows from a table partitioned across multiple processing units of a parallel processing database system. In many occasions in the data processing environment, one may not want to look at or process the whole table of rows. This is because analyzing random samples of data can provide insight into the properties of the entire data without requiring the analysis of the entire data. In such cases, it is extremely useful to be able to obtain one or more (mutually exclusive) random samples of the rows to look at or to process.
For example, instead of computing an average of the whole set, one may be satisfied with an average of the random sample of the rows in the table that may be obtained more quickly. One may also want different samples to train, test and validate a neural network analysis. However, an application would have to fetch the entire data from the database system or at best fetch a non-random sample by limiting the data in some non-random way, e.g., by just looking at first N rows and ignoring the rest. These are not satisfactory alternatives.
Thus, there is a need in the art for improved random sampling of rows stored on a database system.
The present invention discloses a method, apparatus, and article of manufacture for random sampling of rows stored in a table, wherein the table has a plurality of partitions. A row count is determined for each of the partitions of the table and a total number of rows in the table is determined from the row count for each of the partitions of the table. A proportional allocation of a sample size is computed for each of the partitions based on the row count and the total number of rows. A sample set of rows of the sample size is retrieved from the table, wherein each of the partitions of the table contributes its proportional allocation of rows to the sample set of rows. Preferably, the computer system is a parallel processing database system, wherein each of its processing units manages a partition of the table, and some of the above steps can be performed in parallel by the processing units.
An object of the present invention is to incorporate sampling in a parallel processing database system.