A database is a collection of information. A relational database is a database that is perceived by its users as a collection of tables. Each table arranges items and attributes of items in rows and columns respectively. Each table row corresponds to an item (also referred to as a record or tuple), and each table column corresponds to an attribute of the item (referred to as a field, and attribute type, or field type).
To retrieve information from a database, the user of a database system constructs a query. A query contains one or more operations that specify information to retrieve from the database. The system scans tables in the database to execute the query.
A database system often optimizes a query by arranging the order of query operations. The number of unique values for an attribute is one statistic that the database system uses to optimize queries. When the actual number of unique values is unknown, a database system can use an estimate of the number of unique attribute values. An accurate estimate of the number of unique values for an attribute is useful and methods for optimizing a query involving multiple join operations. The database system often uses the estimate in methods that determine the order in which to join tables. An accurate estimate of the number of unique values for an attribute is also useful in methods that reorder and group items. And estimate computed from a sample is typically used for large tables, rather than an exact count of the unique values, because computing the exact count is too time consuming for large tables.
Several types of estimators for estimating the number of unique values of an attribute in a database have been proposed in the database and statistics literature. The proposed estimators perform well depending on the degree of “skewness” in the data. The term skewness refers the variations in the frequencies of the attribute values. Uniform data, or data with “low skewness”, has nonexistent or small variations.
It would thus be beneficial to have an estimator that provides relatively accurate estimates of the number of unique values of an attribute in a database, regardless of the skewness in the data.