The concept of "relational databases" is well-known. Relational databases are described generally in Date, Database Systems, Addison-Wesley publishing Co. 1986, which is herein incorporated by reference. Relational databases usually include a plurality of tables that are searched ("queried") using a well-known query language, such as the SQL or SQL/MP query languages.
Very large databases can include many large tables. Each of these tables includes a plurality of rows and columns. Because it can become quite time-consuming for the system to perform queries on large database having large tables, the process of "query optimization" becomes quite important in processing queries for large databases. Query optimization is usually performed by the software that will execute the query prior to executing the query.
Query optimization in relational databases relies heavily not only on table size, i.e., rowcount, but also column cardinalities, i.e., total number of distinct values in a column. Some conventional systems employ histograms in the optimization process, determining cardinalities for percentile or other partitions. Other conventional systems rely on cost functions derived primarily from such estimates.
Sorting once per column is a rather costly approach, and on the other hand, mathematical statistics has not succeeded in applying sampling theory to this problem with the usual efficacy.
Probabilistic counting methods can achieve arbitrary accuracy for multi-column histogram/UEC estimates in a single parallel pass with minimal memory requirements. A pre-requisite is a hash function producing uniformly distributed output on all input data sets, where uniform is defined as "yielding a uniform statistical distribution." Conventional systems have not employed a truly uniform hashing function. Thus, probabilistic counting has not yielded sufficiently accurate results based on rigorous tests.
For example, a technique described in by Flajolet involves some non-trivial analytic number theory, is based on the intuition that a uniform hash onto s-bit integers should produce values . . . 10.sup.k, with probability approximately 2.sup.-k, for k&lt;s (where 0.sup.k represents a string of k: consecutive 0's.) The largest such k can be used to estimate log2(n). (See Flajolet et al, "Probabilistic Counting Algorithms for Database Applications," Journal of Computer and System Sciences, Vol. 31, No. 2, October 1985, pp. 182-209; Flajolet et al., "Probabilistic Counting," Proc. 24.sup.th Annual Symposium on the Foundations of Computer Science," (IEEE) November 1983, pp. 76-82; and Flajolet, "On Adaptive Sampling," Computing 34, pp. 391-408, each of which is herein incorporated by reference).
Each successive hash value is accordingly examined to determine the rightmost 1-bit, whose position is recorded by OR'ing a bitmap of size m=2.sup.s. At the conclusion of the run, the bitmap will have the form . . . 01.sup.R, and the expected value of R is proven to be EQU E(R)=log2(.phi.n),
where .phi.=0.77351 is analytically determined, with a standard deviation .sigma..sub.R =1.12 (bits).
To reduce the expected error (which would exceed a factor of 2), an appeal is made to the central limit theorem, which, as the basis of sampling theory, asserts that for estimators based on summation sampling variance decreases by a factor of N as sample sizes increase by the same amount, and that sampling distributions tend towards normality. (Without such guarantees, the association of a probability of error with a standard deviation would be practically meaningless).