Histograms are used in databases as lossily compressed representations of the statistics of the data resident in a table. The statistics obtained from the histograms are used for query optimization and in some cases, approximate query processing.
Database modules directed to query optimization tasks often utilize estimates of query result sizes. For example, query optimizers select the most efficient access plan for a query based on estimated costs. These costs can be in turn based on estimates of intermediate result sizes. Sophisticated user interfaces also use approximations of result sizes as feedback to a user before a query is actually executed. Such feedback helps to detect errors in queries or misconceptions about the database. However, these statistics merely approximate the distribution of data values in attributes of the relations, and often are based on assumptions, such as a uniform distribution of attribute values, that often do not hold. Therefore, the statistics can represent an inaccurate picture of the actual contents of the database.