A database is a collection of stored data that is logically related and that is accessible by one or more users. A popular type of database is the relational database management system (RDBMS), which includes relational tables made up of rows and columns (also referred to as tuples and attributes). Each row represents an occurrence of an entity defined by a table, with an entity being a person, place, thing, or other object about which the table contains information.
To extract data from, or to update, a relational table in an RDBMS, queries according to a standard database query language (e.g., Structured Query Language or SQL) are used. Examples of SQL statements include INSERT, SELECT, UPDATE, and DELETE.
A common approach to data analysis on a large database is to work with samples of the data. A sample is a subset of the data chosen randomly so as to be representative of the entire data set. By working with samples instead of the entire data set, the processing time and system resource usage is made much more efficient.
The entire population of data contained in a data set may not be homogenous. For example, in maintaining records of shoppers at a retail outlet, it may be determined that 80% of the shoppers are male while 20% of the shoppers are female. If this is the case, it is sometimes desirable to obtain stratified random samples, as compared to simple random samples. Stratified random sampling involves dividing a given population into homogenous subgroups and then taking a simple random sample in each subgroup. Thus, in the above example, the population is divided into two subgroups, one female and one male.
In conventional database systems, taking stratified random samples require multiple passes through the data set, one for each subgroup. Thus, to obtain the stratified random samples, multiple SQL queries, one per subgroup, are needed. This is due to traditional SQL requirements that every query return only one relation as the result. Conventional techniques of obtaining stratified random samples are thus inefficient.