Statistics are frequently accumulated to describe data in a database, to facilitate accesses made to the data. For example, when a query seeks records meeting multiple selection criteria, the process of assembling the results may be made substantially more efficient by applying the selection criteria in an appropriate order. Ordering is important because the process of scanning a database for matching records is time consuming.
For example, consider a database table (otherwise known as a relation) including columns (otherwise known as attributes) identifying vehicle owners by name and address, and the make, model, model year and other information about their vehicles. An exemplary query into such a relation may seek rows (otherwise known as tuples) identifying the following attribute values: surname “Smith”, city name “New York”, and vehicle manufacturer “Packard”; that is, seeking New Yorkers named Smith who own Packard vehicles. This query involves forming the intersection or “AND” of the results of three selection criteria, “Smith”, “New York”, and “Packard”. However, two of these three criteria are likely to produce a large number of intermediate results—specifically, the surname “Smith” is popular in the United States, and a large number of individuals reside in “New York”. However, “Packard” is a relatively rare vehicle manufacturer. Therefore, this query would be most efficiently performed by selecting those rows having the value “Packard” for the vehicle manufacturer attribute, which will be a relatively small set, and then identifying those rows that also have the surname “Smith” and city “New York”. A far less efficient approach would be to identify all persons with the surname “Smith” or addresses in the city “New York”, and then selecting from either large set those rows meeting the remaining selection criteria.
To attempt to optimize query processing, modern database software often generates statistics prior to executing a query, to estimate the likely size of the solution sets that will be generated from each selection criterion in a query. One way of forming these statistics is from indexes such as the encoded vector index (EVI), disclosed U.S. Pat. No. 5,706,495, Chadha et al., Jan. 6, 1998, “Encoded-Vector Indices For Decision Support and Warehousing”, which is incorporated by reference. Other forms of indexes are used in other circumstances, as is found to be efficient for the particular type of data in use. Typically, statistics are generated using an index and the specific selection criterion of the query being processed. For example, in the above case a statistic would be generated indicating the approximate number of rows having a surname of “Smith”, a city of “New York”, and a manufacturer of “Packard”. These statistics would ideally show that “Packard” is the most selective criterion and should be used first.
After collecting the needed statistics for a given query, those statistics may be cached for later re-use. For example, subsequent queries reusing the selection criterion of the city of “New York”, could re-use the statistic previously generated for that same criterion. However, in order to re-use statistics, those statistics must be validated for the criterion with which they are to be used. The number of rows having a city name of “Brainerd” might be substantially less than the number having a city name of “New York”. Thus, statistics generated for a first selection criterion on a given attribute, cannot be reused for a second selection criterion but rather must be re-validated by re-accessing the associated index. Unfortunately, the time required to access indexes to generate statistics can be a substantial fraction of total query optimization time; thus, re-validation of statistics represents a substantial loss of efficiency in database processing.
It will be appreciated that indexes may be single-dimensional or multi-dimensional. Thus, in the above example, there may be an index formed over both the surname and city attributes, in which case the query optimizer can generate an accurate statistic for the number of rows that will meet the criteria for both the surname and city set forth in a query. In the absence of a multi-dimensional index of this kind, the query optimizer will need to estimate, from the number of records with the surname “Smith”, and the number of records with the city “New York”, the number of records that will meet both criteria, which is typically done by assuming that the same proportion of persons in “New York” are named “Smith” as there are persons named “Smith” in the database as a whole. The resulting statistic is only as accurate as the underlying assumption that the distribution of persons named “Smith” is roughly similar for all cities. Another way to state this assumption, is that there is no correlation between the surname and city attributes, or that these attributes are “independent”. Unfortunately, an assumption of independence of attributes is often incorrect. For example, surname and city are not likely to be independent (because surnames are often reflective of ethnic heritage, and cities have varied ethnic backgrounds). When there is a correlation between attributes of a database, statistics generated using an assumption of independence of those attributes will not be accurate. In such cases, a multi-dimensional index will prove useful, in that statistics formed using a multi-dimensional index for those attributes, will be more accurate than those formed using separate one-dimensional indexes.
Highly correlated attributes raise particular difficulties. Specifically, records in the exemplary database discussed above, may include a vehicle manufacturer and vehicle model attribute. These fields will be highly correlated, in that vehicle model names are typically used by a single manufacturer. For example, the model name “Escort” has been used by the manufacturer “Ford” whereas the model name “Camry” has been used by the manufacturer “Toyota”. Such strong correlation will lead to dramatic over- or under-estimation of statistics by a query optimizer. A query optimizer generating statistics for a query seeking vehicles manufactured by “Toyota”, and named “Camry”, will dramatically underestimate the number of results if those attributes are assumed to be uncorrelated. A query optimizer generating statistics for a query seeking vehicles manufactured by “Toyota” or vehicles named “Camry”, will dramatically overestimate the number of results if those attributes are assumed to be uncorrelated. Further information on these issues can be found in U.S. Pat. No. 5,995,957, Beavin et al., Nov. 30, 1999, “Query Optimization Through the Use of Multi-column Statistics to Avoid the Problems of Column Correlation”, which is incorporated by reference.
While the above demonstrates the usefulness of multi-dimensional indexes, it is often not practical to form multi-dimensional indexes for every possible combination of attributes that might be used in a query, because the resources consumed in generating and storing the indexes will exceed the efficiencies achieved through their use. For this reason, in the past methods have been used to identify attribute sets that are likely candidates for inclusion in a multi-dimensional index. For example, one known method is to monitor the queries that use multiple attributes from a single relation, to identify attribute pairs that are frequently used together, so that multi-dimensional indexes may be formed on these attributes. For further details on this method, see U.S. Pat. No. 5,899,986, Ziauddin, May 4, 1999, “Methods for Collecting Query Workload Based Statistics on Column Groups Identified by RDBMS Optimizer”, which is incorporated herein by reference. Unfortunately, this method may form multi-dimensional indexes on attributes that are independent merely because they are often used together in queries, while not forming indexes on attribute pairs that have high correlation simply because they are used infrequently together. Thus, this method has limited value in increasing the efficiency of queries.
Accordingly, new ways to use statistics and to identify appropriate attributes for multi-dimensional indexes, are needed in order to continue to provide significant improvements in query performance; otherwise, database users will be hampered in their ability to maximize intelligent information retrieval.