It is often desirable to obtain characteristics about the distribution of data stored in databases. For instance, useful characteristics of a sales database includes the number of types of products or services that sold for greater than a particular amount, the number of distinct product or service types that account for a particular percentage of sales, and particular characteristics of subsets/supersets of sales information. One known method for determining such information would be to sort through the database each time such characteristic information is requested. However, often times databases are very large. Accordingly, since all database entries would need to be accessed to determine the needed characteristics, such systems would disadvantageously be to process the information requests.
As a consequence, more recent database systems have predicted such characteristic information based on high-biased histograms maintained for the respective databases. A high-biased histogram is a representation, such as a graph or list, of a particular number of the most frequently occurring categories of items in a data set. The most frequently occurring categories of items are advantageously determined by a count associated with each item category wherein the larger the count, the more frequently occurring that category of items. For example, a high-biased histogram can be used to represent a list of the ten top selling types of products for a business and the amount of sales for each of such products. High-biased histograms and their advantages are described in Y. E. Ioannidis and S. Christodoulakis, "Optimal Histograms for limiting Worst-Case Error Propagation in the Size of Join Results," ACM Trans. Database Sys., vol. 18, No. 4, pp. 709-748 (December 1993). Commercially available database systems, such as Dbase II.RTM., have the ability to generate or report high-biased histograms.
The frequency moments of a data set represent important demographic information about the data, and are important features in the context of database applications. In particular, the frequency moment F.sub.0 is the number of distinct elements appearing in a sequence, the frequency moment F.sub.1 (=m) is the length of the sequence, and the frequency moment F.sub.2 is the repeat rate or Gini's index of homogeneity needed in order to compute the surprise index of the sequence (see, e.g., J. Good, Surprise indexes and P-values, J. Statistical Computation and Simulation 32 (1989), 90-92. As described in the reference P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes, Sampling-Based Estimation of the Number of Distinct Values of an Attribute, Proc. of the 21.sup.st VLDB Conf., 1995, 311-322, the contents and disclosure of which is incorporated by reference as if fully set forth herein, virtually all query optimization methods in relational and object-relational database systems require a mechanism for assessing the number of distinct values of an attribute in a relation, i.e., the frequency moment F.sub.0 for the sequence consisting of the relation attribute.
An important attribute is the second frequency moment of a data set which represents how far from being uniform the frequency distribution of the items in the data set are and is a useful characteristic for guiding the computation in several applications of modem database systems. Furthermore, the second frequency moment, i.e., F.sub.k for k.gtoreq.2, indicates the degree of skew of the data, that is a major consideration in many parallel database applications. Thus, for example, as discussed in D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri, Practical skew handling in parallel joins, Proc. 18.sup.th Int'l. Conf. On Very Large Data Bases, pp. 27, 1992, the contents and disclosure of which is incorporated by reference as if fully set forth herein, the degree of the skew may determine the selection of algorithms for data partitioning. In particular, as discussed in Y. E. Ioannidis and V. Poosala, Balancing Histogram Optimality and Practicality for Query Result Size Estimation, Proc. ACM-SIGMOD 1995, the contents and disclosure of which is incorporated by reference as if fully set forth herein, the frequency moment F.sub.2 may be used for error estimation in the context of estimating query result sizes and access plan costs. This method is based on selecting appropriate histograms for a small number of values to approximate the frequency distribution of values in the attributes of relations. The selection involves joining a relation with itself (frequency moment F.sub.2 is the output size of such join).
The above mentioned Haas et al. reference considers sampling based algorithms for estimating frequency moment F.sub.0, and proposed a hybrid approach in which the algorithm is selected based on the degree of skew of the data, measured essentially by the frequency moment F.sub.2.
Since skew information plays an important role for many applications, it would be beneficial to maintain estimates for frequency moments; and, most notably, for the frequency moment F.sub.2 . For efficiency purposes, the computation of estimates for frequency moments of a relation should preferably be done and updated as the records of the relation are inserted to the database. The general approach of maintaining views, such as distribution statistics, of the data has been well-studied as the problem of incremental view maintenance. Note that conventionally, it is straightforward to maintain the (exact) frequency moments by maintaining a full histogram on the data, i.e., maintaining a counter m.sub.i for each data value i.di-elect cons.{1,2, . . . ,n} which requires memory of size at least the order of "n" (.OMEGA.(n)), where n is the number of possible values. For very large datasets this may be impractical as large memory requirements would require storing the data structures in external memory, which would imply an expensive overhead in access time and update time. Thus, it is important that the memory used for computing and maintaining the estimates be limited. The restriction on memory size is further emphasized by the observation that sometimes incoming data records will belong to different relations that are stored in the database; each relation requiring its own separate data structure. Thus, the problem of computing or estimating the frequency moments in one pass under memory constraints arises naturally in the study of databases.
There are several known randomized algorithms that approximate frequency moments F.sub.0 and F.sub.1 using limited memory. However, currently there exists no randomized algorithm for estimating the frequency moment F.sub.k, where k&gt;1, that can guarantee high accuracy with high probability, using small amount of memory space.
It would thus be highly desirable to obtain tight bounds to achieve a reduced memory requirement necessary in approximating the frequency moments F.sub.k, where k&gt;1, without the need for raiding the dataset.