Statistics are frequently accumulated to describe data in a database, to facilitate accesses made to the data. Statistics are typically used by an optimizer to create an access plan for accessing the data. For example, when a query seeks records meeting multiple selection criteria, the optimizer may determine that the results may assembled most efficiently by applying the selection criteria in an appropriate order. Ordering is important because the process of scanning a database for matching records is time consuming.
Among the factors considered in formulating an access plan, are the number of swaps between memory and disk that are required to satisfy a query for data. To estimate this typically requires a prediction of memory buffer usage. For example, consider a database table (otherwise known as a relation) including columns (otherwise known as attributes) identifying vehicle owners by name and city, and the make, model, model year and other information about their vehicles. A database of this kind is shown in FIG. 1A of the drawings, which shows just eight selected rows of a much larger relation of over one thousand rows, as can be seen from the row number which is included to the left of each row. It should also be noted that the rows of the relation are ordered by the owner's surname. An exemplary query into such a relation may seek rows (otherwise known as tuples) identifying the following attribute values: surname “Smith”, city name “Atlanta”, and vehicle manufacturer “VW”; that is, seeking Atlanta named Smith who own VW vehicles. This query involves forming the intersection or “AND” of the results of three selection criteria, “Smith”, “Atlanta”, and “VW”.
If the data is arranged on disk storage in order according to the surname of the vehicle owners, as it is in FIG. 1A, the tuples meeting criteria on the surname attribute, are likely to be found by reading only a relatively small number of records from a relatively small number of pages of disk storage. For example, the row number for Beth Smith is 1011 and the row number for Harold Smith is 1014, so these are likely to be in a common page that can be swapped from disk to memory. In this case, the described query would be most efficiently performed by selecting rows based upon that ordered criterion first, i.e., selecting those rows having the surname “Smith”, and then identifying those rows within that initial result in the city “Atlanta”, and then identifying those rows within that intermediate result with the value “VW” for the vehicle manufacturer attribute. A far less efficient approach would be to identify all persons with “VW” vehicles, and then locating those with the surname “Smith” or addresses in the city “Atlanta”.
However, in many cases, the data will be somewhat or completely randomly organized on disk with regard to an attribute of interest. FIG. 1B is a table with the same data as in FIG. 1A, but not ordered by the owner's surname. Rather, in the illustrated example, the eight rows illustrated in FIG. 1A are randomly scattered on disk. In such a situation, retrieving tuples meeting the first criterion, i.e., owner surname “Smith”, is likely to require an extensive number of reads from dispersed areas in disk storage. (The two Smiths noted above, appear in rows 51 and 1496.) Indeed, the data may be closer to correctly ordered on another attribute, such as the city attribute, than it is in the owner surname attribute. Thus, with the data ordered as shown in FIG. 1B, it would be best to perform the criterion on another attribute first, and then perform the criterion on the owner's surname and vehicle manufacturer.
To attempt to optimize query processing, modern database software often estimates the clustering of data prior to attempting to execute a query. U.S. Pat. No. 5,043,872 described a method for calculating a clustering coefficient from an index for a relation.
As discussed in the '872 patent, most relational data base systems maintain indexes for their relations. An index is a list stored separately from the relation, and used to select tuples in the relation for access. An index typically is much smaller than the relation, because it usually indexes only one or a few attributes of the relation, so that tuples of interest in the relation can be identified and retrieved without scanning the relation itself. In a typical single-attribute index, such as is shown in FIG. 2, each entry in the index corresponds to one tuple in the relation, and contains the value of the attribute for that tuple, and a pointer to the location of the tuple in disk storage. The index of FIG. 2 is an index for the owner attribute of the table organized according to FIG. 1B. The owner surnames in the index are in a sorted order, and thus appear at the same locations (row numbers) as the corresponding rows in the sorted table of FIG. 1A; however, the pointers in the index identify the actual locations of those rows in the actual table ordering shown in FIG. 1B.
The clustering factor described in the '872 patent is computed from an index such as that of FIG. 2. The clustering factor is proportional to the number of rows which, when in index order, are in a sequence which is the same as the sequence in which they are stored on the data pages. This clustering coefficient is used to estimate the number of pages to be accessed during the index scan. The number of page accesses may then be used to select an access path for a query, join, or other indexed data base operation.
U.S. Pat. No. 5,778,353 describes a refinement of the '872 patent, that accounts for variations in the size of the buffer pool available, at the time access strategy is being developed (i.e. when the optimizer is selecting the best index). Both of these U.S. Patents are hereby incorporated herein by reference.
While the '872 patent describes an effective scheme for improving access planning by computing a clustering factor, the method of computation is complex and consumes substantial processing resources as part of performing an optimization of an access plan. A less computationally expensive computation of clustering factor would therefore have an important performance advantage.