Computer database systems manage the storage and retrieval of data in a database. A database comprises a set of tables of data along with information about relations between the tables. Tables represent relations over the data. Each table comprises a set of records or tuples of data stored in one or more data fields. The records of a table are also referred to as rows, and the data fields of records in a table are also referred to as columns. A database server processes data manipulation statements or queries, for example, to retrieve, insert, delete, and update data in a database. Queries are defined by a query language supported by the database system.
For large databases such as data warehouses, for example, typical tools such as On Line Analytical Processing (OLAP) and data mining serve as middleware or application servers that communicate data retrieval requests to a backend database system through a query. Although the cost of executing ad-hoc queries against the backend can be expensive, many data mining applications and statistical analysis techniques can use a sample of the data requested through the query. Similarly, OLAP servers that answer queries involving aggregation (e.g., xe2x80x9cfind total sales for all products in the NorthWest region between Jan. 1, 1998 and Jan. 15, 1998xe2x80x9d) benefit from the ability to present to the user an approximate answer computed from a sample of the result of the query posed to the database.
Sampling is preferably supported not only on existing stored or base relations but also on relations produced as a result of an arbitrary query. Sampling may be supported in relational databases as a primitive operation SAMPLE(R,f), for example, to produce a sample S of r tuples that is an f-fraction of a relation R. Fully evaluating a query Q to compute relation R only to discard most of relation R when applying SAMPLE(R,f), however, is inefficient. Preferably, query Q may be partially evaluated so as to produce only sample S of relation R.
For a given query tree T for computing a relation R that is the result of a query Q where SAMPLE(R,f) is the root or last operation of query tree T, pushing the sample operation down tree T toward its leaves would help minimize the cost of evaluating query Q as only a small fraction of stored and/or intermediate relations would be considered in evaluating query Q. The ability to commute the sample operation in this manner, however, depends on the relational operations used in query tree T. The standard relational operation of selection can be freely interchanged with sampling. With join operations, however, sampling may not be so easily commuted.
FIG. 1 illustrates a query tree 100 for obtaining a sample of a join of operand relations R1 and R2. Query tree 100 is executed in accordance with a flow diagram 200 of FIG. 2. For step 202 of FIG. 2, a relation J is computed by joining R1 and R2, or J=R1R2. For step 204, r tuples are randomly sampled from relation J to produce a sample relation S. Commuting the sample operation in query tree 100 to operand relations R1 and R2, as illustrated by a query tree 300 in FIG. 3, would minimize the cost of obtaining a join sample because only samples of operand relations R1 and R2 would need to be joined. A join of samples of operand relations R1 and R2, however, will not likely give a random sample of the join of operand relations R1 and R2.
As one example:
R1(A,B)={(a1,b0), (a2,b1), (a2,b2), (a2,b3), . . . , (a2,bn)}
and
xe2x80x83R2(A,C)={(a2,c0), (a1,c1), (a1,c2), (a1,c3), . . . , (a1,cn)}.
That is, relation R1 is defined over attributes A and B. Among the n+1 tuples of relation R1, one tuple has an A-value a1 and n tuples have an A-value a2, but all n+1 tuples of relation R1 have distinct B-values. Similarly, relation R2 is defined over attributes A and C. Among the n+1 tuples of relation R2, n tuples have an A-value a, and one tuple has an A-value a2, but all n+1 tuples of relation R2 have distinct C-values.
Computing the equi-join of relations R1 and R2 over attribute A produces the following relation:
J=R1R2={(a1,b0,c1), (a1,b0,c2), (a1,b0,c3), . . . , (a1,b0,cN), (a2,b1,c0), (a2,b2,c0), (a2,b3,c0), . . . , (a2,bn,c0)}.
That is, relation J has n tuples with A-value a1 and n tuples with A-value a2.
About one half of the tuples in a random sample S of relation J, or S⊂J. would likely have an A-value of a1 while the remaining tuples would have an A-value of a2. A random sample S1 of relation R1, or S1⊂R1, however, would not likely comprise tuple (a1,b0), and a random sample S2 of relation R2, or S2⊂R2, would not likely comprise tuple (a2,c0). The join of samples S1 and S2 would then likely comprise no tuples and therefore would not likely give random sample S of relation J.
One prior sampling strategy for obtaining a sample S of a join of two relations R1 and R2 with respect to a join attribute A is illustrated as a flow diagram 400 in FIG. 4.
For notational purposes, relations R1 and R2 have sizes n1 and n2, respectively. The domain of join attribute A is denoted by D. For each value v of domain D, or v∈D, m1(v) and m2(v) denote the number of distinct tuples in relations R1 and R2, respectively, that contain value v in attribute A. Then, xcexa3v∈D m1(v)=n1 and xcexa3v∈D m2(v)=n2. A relation J results from the computation of the join of relations R1 and R2, or J=R1R2, and n is the size of relation J, or n=|J|=|R1R2|. Then, n=xcexa3v∈D m1(v)m2(v). For each tuple t of relation R1, the set of tuples in relation R2 that join with tuple t is denoted as Jt(R2)={txe2x80x2∈R2|txe2x80x2.A=t.A}; tR2 denotes the set of tuples in R1 R2 obtained by joining tuple t with the tuples in Jt(R2); and |tR2|=|Jt(R2)|=m2(t.A). Similarly for each tuple t of relation R2, Jt(R1)={txe2x80x2∈R1|txe2x80x2.A=t.A}; R1t denotes the set of tuples in R1R2 obtained by joining tuples in Jt(R1) with tuple t; and |R1t|=|Jt(R1)|=m1(t.A).
For step 402 of FIG. 4, a variable r is initialized to the size of a sample relation S to be obtained from the join of relations R1 and R2. For step 404, a variable M is initialized to the upper bound on the number of join attribute values v in relation R2 for all values v of domain D on attribute A. That is, M is the maximum number of any one join attribute value in relation R2. A tuple t1 is randomly sampled from relation R1 for step 406. A tuple t2 is then randomly sampled for step 408 from among all tuples of relation R2 having a join attribute value t2.A that matches the join attribute value t1.A of tuple t1. For step 410, a tuple T is computed as T=t1t2 and output for sample relation S with a probability based on the number of tuples in relation R2 having a join attribute value that matches that of tuple t, divided by M, or m2(t2.A)/M. If not output, the sample tuple t1 is rejected for step 410. If r tuples have not yet been output for sample relation S as determined for step 412, steps 406 through steps 412 are then repeated until r tuples have been output to form sample relation S as determined for step 412. Flow diagram 400 then ends for step 414.
The sampling technique of FIG. 4 in practice, however, requires indexes for random access to relations R1 and R2, noting relation R1 must be materialized for proper sampling because the rejection of tuples for step 410 requires that the number of samples from relation R1 be a random variable having a distribution dependent upon the distribution of join attribute values in relation R2. This strategy therefore has limited applicability in commuting sampling with joins involving intermediate relations that are produced as a result of an arbitrary query in a query tree and that are not materialized and indexed.
The ability to sample tuples produced as a stream, that is to perform sequential sampling, is significant not only because intermediate relations produced by a pipeline, such as in a query tree for example, may be sampled without materialization but also because a relation, whether materialized or not, may be sampled in a single pass. How and whether sequential sampling may be performed, however, may depend on the chosen semantics for the sampling.
The tuples of a relation may be sampled, for example, using with replacement (WR), without replacement (WoR), or independent coin flips (CF) semantics.
For WR sampling of an f-fraction of the n tuples in a relation R. each sampled tuple is chosen uniformly and independently from among all tuples in relation R, noting any one tuple could be sampled multiple times. The sample is a bag or multiset of f*n tuples from relation R.
For WR sampling an f-fraction of the n tuples in a relation R, f*n distinct tuples are sampled from relation R, noting each successive sampled tuple is chosen uniformly from the set of tuples not yet sampled. The sample is a set of f*n distinct tuples from relation R.
For CF sampling an f-fraction of the n tuples in a relation R, each tuple in relation R is chosen for the sample with probability f, independent of other tuples. Sampling in this manner is analogous to flipping a coin with bias f for each tuple in turn. The sample is a set of X distinct tuples from relation R, where X is a random variable with the binomial distribution B(n,f) and has expectation f*n. The binomial distribution B(n,f) is, in effect, the distribution of a random value generated by counting the total number of heads when flipping n independent coins, each of which has a probability f of being heads. Sampling using independent coin flip semantics is also called binomial sampling.
The sampling of a relation may also be weighted or unweighted. For unweighted sampling, each element is sampled uniformly at random. For weighted sampling, each element is sampled with a probability proportional to its weight for some pre-specified set of weights.
One prior sequential sampling technique uses CF semantics by sampling each passing tuple of a relation R with probability f for heads and adds the tuple to a sample S if the probability is satisfied. Another prior sequential sampling technique uses WoR semantics by initializing a list or reservoir of r tuples with the first r tuples of relation R and repeatedly removing random tuples from the list while adding tuples from relation R to the end of the list to produce a sample S. Each of these techniques do not require the size of relation R in advance and may therefore be used for sampling relations that are not materialized. Each of these techniques also preserve sortedness by producing a sample of tuples in the same relative order as in relation R. The reservoir sampling technique, however, does not produce a sequential output of tuples as no tuples are output until the technique has terminated. In the case of scanning a materialized relation on a disk, however, the reservoir sampling technique may be efficient by reading only those tuples to be entered in the reservoir by generating random intervals of tuples to be skipped.
A sample operator for obtaining a sample of a plurality of records in a database system has the plurality of records and sampling semantics as parameters. The sampling semantics may be with replacement, without replacement, or coin flip sampling semantics. The sample operator may also have a size of the sample as a parameter and/or a weight function as a parameter to specify a sampling weight for each record.
Another sample operator for obtaining a sample of a plurality of records in a database system has the plurality of records as a parameter and a weight function as a parameter to specify a sampling weight for each record. The sample operator may also have a size of the sample as a parameter.
A method obtains a sample from a plurality of records in a database system. The method may be implemented by computer-executable instructions of a computer readable medium.
For the method, the plurality of records and sampling semantics are identified from parameters of a sample operator, and a sample is obtained from the identified plurality of records using the identified sampling semantics. The identified sampling semantics may be with replacement, without replacement, or coin flip sampling semantics. A size of the sample to be obtained may be identified from a parameter of the sample operator, and the sample may be obtained from the identified plurality of records based on the identified sample size. A weight function may be identified from a parameter of the sample operator to specify a weight for each record, and the sample may be obtained from the identified plurality of records based on the specified weight of each record.
The sample may be obtained by obtaining one record from the plurality of records, selectively outputting the one record one or more times based on a probability, and repeating these steps for one or more other records of the plurality of records to obtain the sample. The sample may also be obtained by obtaining one record from the plurality of records, selectively resetting one or more records of a reservoir to be the one record based on a probability, and repeating these steps for other records of the plurality of records such that the records of the reservoir form the sample.
Another method obtains a sample from a plurality of records in a database system. The method may be implemented by computer-executable instructions of a computer readable medium.
For the method, the plurality of records and a weight function are identified from parameters of a sample operator, wherein the weight function specifies a weight for each record, and a sample is obtained from the identified plurality of records based on the specified weight of each record. A size of the sample to be obtained may be identified from a parameter of the sample operator, and the sample may be obtained from the identified plurality of records based on the identified sample size.
The sample may be obtained by obtaining one record from the plurality of records and the weight specified for the one record, selectively outputting the one record one or more times based on the weight specified for the one record, and repeating these steps for one or more other records of the plurality of records to obtain the sample. The sample may also be obtained by obtaining one record from the plurality of records and the weight specified for the one record, selectively resetting one or more records of a reservoir to be the one record based on the weight specified for the one record, and repeating these steps for other records of the plurality of records such that the records of the reservoir form the sample.
Another method performs a sequential sampling of records in one pass in a database system. The method may be implemented by computer-executable instructions of a computer readable medium. The database system may perform the method with suitable means.
For the method, one record from a plurality of records is obtained and selectively output one or more times based on a probability. The plurality of records may be a relation produced as a stream of records as a result of a query or may be materialized as a base relation in a database of the database system.
The one record may be selectively output by determining a random number based on the probability such that the random number is greater than or equal to zero and outputting the one record the determined random number of times. The random number may be determined from a binomial distribution based on the probability. The random number may be determined based on a probability based on a number of record(s) of the plurality of records to be evaluated for output, based on a probability based on a weight of the one record divided by a sum of weight(s) of record(s) of the plurality of records to be evaluated for output, or based on a probability based on a fraction of the plurality of records. The random number may be determined such that the random number is less than or equal to a number of record(s) remaining to be output for the sample or such that the random number is less than or equal to a weight of the one record.
The one record may be selectively output one or more times based on a weight specified for the one record. The one record may be selectively output based on a probability a number of time(s) equal in number to the weight of the one record, and that probability may be based on a number of record(s) remaining to be output for the sample divided by a number of possible record(s) that may be output.
These steps are repeated for one or more other records of the plurality of records to form a sample of the plurality of records, wherein at least one obtained record may be output more than one time. The plurality of records may form a relation, and the sample may be joined with records of another relation.
Another method performs a sequential sampling of records in one pass in a database system. The method may be implemented by computer-executable instructions of a computer readable medium. The database system may perform the method with suitable means.
For the method, one record from a plurality of records is obtained, and one or more records of a reservoir are selectively reset to be the one record based on a probability.
Each record of the reservoir may be selectively reset to be the one record based on a probability. One or more records of the reservoir may be selectively reset to be the one record with a probability based on a number of record(s) that have been obtained. One or more records of the reservoir may be selectively reset to be the one record based on a weight of the one record. One or more records of the reservoir may be selectively reset to be the one record with a probability based on a weight of the one record divided by a sum of weight(s) of record(s) that have been obtained.
A random record of the reservoir may be selectively reset to be the one record based on a probability a number of time(s) equal in number to the weight of the one record. A random record of the reservoir may be selectively reset to be the one record with a probability based on a number of records in the reservoir divided by a sum of record(s) evaluated for reset in the reservoir.
These steps are repeated for other records of the plurality of records such that the records of the reservoir form a sample of the plurality of records, wherein at least one obtained record may be used to reset more than one record of the reservoir. The plurality of records may form a relation, and the sample may be joined with records of another relation.