Databases store large amounts of data in the form of "relations," which may be viewed as tables of data items. The size of a given relation may be on the order of several megabytes to several gigabytes or more, depending on the application. A relation stores the data items, which are also referred to as records or "tuples," in a predetermined format. For example, a relation in a database of bank customers may be structured to hold categories of information such as customer name, account number, account balances and the like. Each category is referred to as an "attribute" of the relation. The particular values entered in the attribute of a relation for a given customer constitute tuples of that relation. In many applications, relations are continuously updated as tuples are inserted, deleted or modified during normal operation of the database. A database management system (DBMS) is used to manage these updates as well as other database operations, such as query processing, evaluation and optimization.
It is often desirable for a given DBMS to be able to estimate the characteristics of database relations. For example, query optimizers of most relational database systems decide on the most efficient access plan for a given query based on estimates of the intermediate result sizes of queries and the costs of executing various plans, as described in P. Selinger, M. Astrahan, D. Chamberlin, R. Lorie and T. Price, "Access path selection in a relational database management system," Proc. of ACM SIGMOD Conf., pp. 23-34, 1979. Also, quick estimates of query result characteristics can be useful to a user during query design and in providing approximate answers to queries, as described in, for example, J. M. Hellerstein, P. J. Haas and H. J. Wang, "Online aggregation," Proc. of ACM SIGMOD Conf., 1997.
Most DBMSs therefore maintain a variety of statistics on the contents of database relations in order to perform various estimations. These statistics usually take the form of approximations of the distributions of particular data values in attributes of the relations. Since the statistics are used to estimate the values of several parameters of interest to optimizers, the validity of the optimizer's decisions may be affected by the quality of these approximations, as described in S. Christodoulakis, "Implications of certain assumptions in database performance evaluation," ACM TODS, 9(2): 163-186, June 1984. In fact, it has been established in Y. Ioannadis and S. Christodoulakis, "On the propagation of errors in the size of join results," Proc. of ACM SIGMOD Conf., pp. 268-277, 1991, that selectivity estimation errors propagate exponentially with the number of operators in the query and hence will ultimately result in a poor-quality plan being chosen by the optimizer. This is becoming very important in the context of increasingly complex queries, such as decision support and data analysis queries.
A commonly-used technique for selectivity estimation involves maintaining histograms on the frequency distribution of an attribute. A histogram groups attribute values into subsets, referred to as "buckets," and approximates true attribute values and their frequency distributions based on summary statistics maintained in each bucket. Since histograms are just summaries of actual, much larger distributions, any estimations based on histograms may still incur errors. Nonetheless, for most real-world databases, there exist histograms that produce acceptable low-error estimates while occupying reasonably small space, which may be on the order of only hundreds of bytes. Histograms are used in conventional DBMSs such as DB2, Informix, Ingres, and Microsoft SQL Server, as well as in other DBMSs from Oracle, Sybase, and Teradata. Histograms are also being used to provide various estimates in other database management applications, such as parallel join load balancing, as described in V. Poosala and Y. Ioannidis, "Estimation of query-result distribution and its application in parallel-join load balancing," Proc. of the 22.sup.nd Int. Conf. on Very Large Databases, September, 1996, which is incorporated by reference herein.
Histograms are usually precomputed on the underlying data and used without much additional overhead inside the query optimizer. A significant drawback of using precomputed histograms is that they may get outdated when the data in the database is modified, and hence introduce significant errors in estimations. On the other hand, it is clearly impractical to compute histograms after every update to the database. Fortunately, it is often not necessary to keep the histograms perfectly up-to-date at all times, because they are used only to provide reasonably accurate estimates. Instead, one needs appropriate techniques for propagating updates to histograms, so that the database performance is not affected. Despite the popularity of histograms, many important issues related to their maintenance have not been adequately addressed.
A typical conventional histogram maintenance approach, which is followed in a substantial number of commercial database systems, involves recomputing histograms periodically (e.g., every night). This periodic recomputation approach has at least two serious disadvantages. First, any significant updates to the database between two successive recomputations could cause poor estimations inside a query optimizer. Second, since the old histograms are discarded and new histograms are recomputed, the recomputation phase for the entire database is computationally very intensive. Other conventional histogram maintenance techniques focus on proper bucketizations of values in order to enhance the accuracy of the histograms, and generally assume that the database is not being modified. Although efficient sampling-based methods are available to construct various histograms, these methods also generally ignore the problem of histogram maintenance. With the increasing use of histograms in multiple components of a DBMS, some of which (e.g., user interfaces) may have even higher accuracy requirements on estimates than query optimizers, it has become very important to identify more efficient techniques for maintaining histograms.