Databases store large amounts of data in the form of "relations," which may be viewed as tables of data items. The size of a given relation may be on the order of several megabytes to several gigabytes or more, depending on the application. A relation stores the data items, which are also referred to as records or "tuples," in a predetermined format. For example, a relation in a database of bank customers may be structured to hold categories of information such as customer name, account number, account balances and the like. Each category is referred to as an "attribute" of the relation. The particular values entered in the attribute of a relation for a given customer constitute tuples of that relation. In many applications, relations are continuously updated as tuples are inserted, deleted or modified during normal operation of the database. A database management system (DBMS) is used to manage these updates as well as other database operations, such as query processing, evaluation and optimization.
It is often desirable for a given DBMS to be able to estimate the characteristics of database relations. For example, query optimizers of most relational database systems decide on the most efficient access plan for a given query based on estimates of the intermediate result sizes of queries and the costs of executing various plans, as described in P. Selinger, M. Astrahan, D. Chamberlin, R. Lorie and T. Price, "Access path selection in a relational database management system," Proc. of ACM SIGMOD Conf., pp. 23-34, 1979. Also, quick estimates of query result characteristics can be useful to a user during query design and in providing approximate answers to queries, as described in, for example, J. M. Hellerstein, P. J. Haas and H. J. Wang, "Online aggregation," Proc. of ACM SIGMOD Conf., 1997.
Most DBMSs therefore maintain a variety of statistics on the contents of database relations in order to perform various estimations. These statistics usually take the form of approximations of the distributions of particular data values in attributes of the relations. Since the statistics are used to estimate the values of several parameters of interest to optimizers, the validity of the optimizer's decisions may be affected by the quality of these approximations, as described in S. Christodoulakis, "Implications of certain assumptions in database performance evaluation," ACM TODS, 9(2): 163-186, June 1984. In fact, it has been established in Y. Ioannadis and S. Christodoulakis, "On the propagation of errors in the size ofjoin results," Proc. of ACM SIGMOD Conf , pp. 268-277, 1991, that selectivity estimation errors propagate exponentially with the number of operators in the query and hence will ultimately result in a poor-quality plan being chosen by the optimizer. This is becoming very important in the context of increasingly complex queries, such as decision support and data analysis queries.
A commonly-used technique for selectivity estimation involves maintaining histograms on the frequency distribution of an attribute. A histogram groups attribute values into subsets, referred to as "buckets," and approximates true attribute values and their frequency distributions based on summary statistics maintained in each bucket. Since histograms are just summaries of actual, much larger distributions, any estimations based on histograms may still incur errors. Nonetheless, for most real-world databases, there exist histograms that produce acceptable low-error estimates while occupying reasonably small space, which may be on the order of only hundreds of bytes. Histograms are used in conventional DBMSs such as DB2, Informix, Ingres, and Microsoft SQL Server, as well as in other DBMSs from Oracle, Sybase, and Teradata. Histograms are also being used to provide various estimates in other database management applications, such as parallel join load balancing, as described in V. Poosala and Y. Ioannidis, "Estimation of query-result distribution and its application in parallel-join load balancing," Proc. of the 22.sup.nd Int. Conf. on Very Large Databases, September, 1996, which is incorporated by reference herein.
Histogram computation and other types of selectivity estimation for a given database relation can be facilitated through the use of a random sample of the relation. For example, uniformly random samples of the data in a database relation can be precomputed and then maintained in the presence of updates to the relation. This avoids the need to compute samples dynamically at usage time, which is a significant drawback of conventional sampling-based estimation techniques. In most sampling-based estimation techniques, whenever a sample of size n is needed, either the entire relation is scanned to extract the sample, or several random disk blocks are read. In the latter case, the tuples in a disk block may be highly correlated, and hence to obtain a truly random sample, n disk blocks must be read. Sampling-based estimation techniques which utilize dynamic sampling are therefore very inefficient. Although precomputed random samples can provide considerable improvements over dynamic sampling, conventional techniques for generating random samples are unable to accommodate the requirements of database management techniques such as histogram computation and other types of selectivity estimation.