1. Technical Field
The present disclosure relates generally to histograms in a database system query processing, and more particularly to, rebuilding histograms based on histogram content.
2. Related Art
A database is a collection of stored data that is logically related and that is accessible by one or more users or applications. A popular type of database is the relational database management system (RDBMS), which includes relational tables, also referred to as relations, made up of rows and columns (also referred to as tuples and attributes). Each row represents an occurrence of an entity defined by a table, with an entity being a person, place, thing, or other object about which the table contains information.
In identifying an optimal plan for responding to a database query, the information of value frequencies for a column greatly helps in choosing the optimal plan for the queries referring to the column. However, it requires prohibitively large amount of space to keep the frequencies of all values on the column. Most database systems support histograms on one or more columns, which is a set of intervals that group adjacent column values together. Each interval of a histogram consists of the minimum value, the maximum value and an average frequency. Instead of using actual frequencies of the values in an interval, a database system typically uses the average frequency for its query planning. The average frequency of an interval works fine when the value frequencies are similar one another. But, when significantly different frequencies are grouped into the same interval, the average frequency could mislead the database system to pick a non-optimal plan. In general, as a histogram has more number of intervals, there is less chance that the values whose frequencies are different significantly are grouped into the same interval, but the histogram may requires more space and time to store/maintain/use.