Most relational database management systems (RDBMSs), as part of their functionality, collect statistics about tables, either automatically or under instruction from database administrators (DBAs). These statistics are used by the query optimizer to select the most appropriate execution plan when these tables are being queried. Given the cost based nature of modern optimizers, up to date table statistics are vital for the selection of the most appropriate plan.
One of the most important statistics gathered are data distributions. Data distributions describe to the optimizer the distribution of rows across distinct values of a specific column. When data distributions are not available, the optimizer typically assumes uniform distribution of distinct values across all rows. However data is seldom uniformly distributed, and when query predicates contain specific constant values for a given column, optimizers can select more appropriate query plans if the knowledge is available that a particular value appears in a larger or smaller proportion of the rows.
Data distributions are generally represented as a list of “bins”, each bin tracking a fixed fraction of rows in a table. The number of bins and the amount of rows tracked depends on the resolution used to create the distributions. When arranged contiguously, the bins form a histogram of the distribution of the data. Collecting distributions can be a very costly operation, as it implies a whole scan of the table and subsequent sorting of data values, as a result of which, DBAs are challenged to strike the right balance between collecting statistics often enough for the optimizer to select the correct plans at all times, and not so often as to overload the system. For example, for very large tables, such as those with a terabyte or more, histogram gathering may last several hours.
Once the distributions are acquired, depending on the pattern of Data Manipulation Language (DML) operations altering the contents of the tables, a substantial amount of DML operations may be needed before the difference in the actual data and the data distribution known to the optimizer is such that the optimizer chooses a suboptimal execution plan. Assuming a random distribution of values being modified, each individual distribution bin will have a slightly skewed distribution of data as compared to the actual data. However, the differences are likely not significant, and the distributions will convey a good enough representation of the data presently in the table for the optimizer to still select an appropriate execution plan. Thus, for a random distribution of values being modified, DBA's can gauge the type of interval between statistics collections that would affect the optimizer's plan selection and set the statistics collections seldom enough to not burden the system.
However, when the DML operation distribution is not random, i.e., the pattern of insertion and deletion of data is predictable, the differences in the distributions will affect the optimizer's plan selections. Example scenarios include journals or queues, where typically the data grow with time, and in the case of queues, the data is removed when expired. These tables normally have columns with ever increasing individual ID's and/or timestamps. Other non-random data manipulation scenarios are possible, such as data being inserted at either end of the table, or data being deleted from the higher end and inserted at the lower end of the table. In these circumstances, distribution information for the individual ID's or timestamps columns is almost certainly correct for a specific fraction of the data. When asked to select an execution plan within this fraction of the data, the optimizer will choose an appropriate plan.
However, the optimizer has no knowledge of any data inserted after the last time statistics were collected, and, conversely, it may incorrectly assume that older data still exists. Over time, new data values may be significantly higher than the highest distribution bin and old data values may carve well into the lower distribution bins. For any query that tries to access data for which there is no distribution information, or for values that fall within the stale distribution bins, the optimizer may select inefficient plans.