Statistics are frequently accumulated to describe data in a database, to facilitate accesses made to the data. Statistics are typically used by an optimizer to create an access plan for accessing the data. For example, when a query seeks records meeting multiple selection criteria, the optimizer may determine that the results may assembled most efficiently by applying the selection criteria in an appropriate order. Ordering is important because the process of scanning a database for matching records is time consuming.
Among the types of selection criteria usable on database, are pattern matching criteria such as the SQL “LIKE” predicate. To understand the use of this predicate, consider a database table (otherwise known as a relation) named “Owners”, including columns (otherwise known as attributes) identifying vehicle owners by name and city, and the make, model, model year and other information about their vehicles. A table of this kind is shown in FIG. 1A of the drawings, which shows just eight selected rows of a much larger relation. It should be noted that the rows of the relation are completely randomly organized on disk with regard to the attributes.
An exemplary query into such a relation may seek rows (otherwise known as tuples) having a surname beginning with “Jo . . . ”. This query in SQL could take the form:
SELECT * from Table
WHERE Table.Surname LIKE “Jo . . . ”
This query involves finding all rows having a surname that begins with “Jo”, which would include at least the rows with “Johnson” and “Jones” of the example of FIG. 1.
Most relational data base systems maintain indexes for their relations. An index is a list stored separately from the relation, and used to select tuples in the relation for access. An index typically is much smaller than the relation, because it usually indexes only one or a few attributes of the relation, so that tuples of interest in the relation can be identified and retrieved without scanning the relation itself. For example, an index for an attribute is shown in FIG. 2; this index includes entries containing each distinct value of the Surname attribute, and a pointer to the location of the tuples having that attribute value in disk storage.
Often, a hash lookup is used to find the index entry that corresponds to a given attribute value, so as to avoid the computationally expensive process of scanning each index entry to find the desired value. Unfortunately, this lookup method is not usable when dealing with pattern matching predicates such as LIKE, since all possible strings matching the pattern matching predicate would need to be passed through the hash function to identify all matching entries in the index, which would be computationally prohibitive in most cases. Thus, in most cases the only way to find the relevant index entries for a search string such as “Jo . . . ” would be to review every index entry to determine whether it matches the pattern, or not, which is computationally prohibitive.
Performing access planning without statistics can lead to gross inefficiency. Therefore, there is a need to provide a computationally inexpensive approach to statistically estimating the number of rows of a table that will be returned by a pattern matching predicate.