In relational databases, a histogram provides very important data distribution statistics for query optimizers to estimate the selectivity of a query predicate or the cardinality of a join. The selectivity is an indication as to whether a portion of the query is more efficiently processed by using an index or more efficiently found by iteratively scanning rows of the database. The higher the selectivity the better it is for the query optimizer to use the index to find a portion of a query while the lower the selectively the better it is for the query optimizer to scan rows of the database iteratively to find a portion of the query. The data distribution provided by a histogram provides a mechanism for the query optimizer to estimate selectivity or cardinality of a join based on each histogram data bucket.
However, traditional techniques use a low-dimension histogram approach that usually does not work well for high-dimensional databases. For example, a traditional histogram built for 5-dimensions of data (e.g. 5 columns in a table) where each dimension has 128 buckets or 27*27*27*27*27=32 GB buckets requiring 128 GB (gigabytes) of memory and/or storage. Even though some compression techniques can alleviate this memory/storage utilization problem, the overall space and computation costs required for processing a query at run time is still often unacceptable for even the most advanced database systems. Moreover, very few database tables are limited to just 5 columns; more likely, an average commercial database may approach 10's and even, in some situations, 100's of columns (dimensions). So, the issue of memory, storage, and processing efficiency is grossly understated by the example presented and is orders of magnitude larger in the average database deployment scenario associated with the industry.
In fact, in both the academia literature and the industrial literature there are very few, if any, stated research projects or stated industry practices that address how to handle query optimization using high-dimension histograms due to the technical obstacles that such an approach presents.
Therefore, there is a need to provide the benefits of selectivity and cardinality estimation for query optimizers when processing queries that can use high dimension histograms without the technical problems that have heretofore been unsolved or purposefully avoided due to the perceived complexity of the issue.