Databases are used to store information for an innumerable number of applications, including various commercial, industrial, technical, scientific and educational applications. As the reliance on information increases, both the volume of information stored in most databases, as well as the number of users wishing to access that information, likewise increases. Moreover, as the volume of information in a database, and the number of users wishing to access the database increases, the amount of computing resources required to manage such a database increases as well.
Database management systems (DBMS's), which are the computer programs that are used to access the information stored in databases, therefore often require tremendous resources to handle the heavy workloads placed on such systems. As such, significant resources have been devoted to increasing the performance of database management systems with respect to processing searches, or queries, to databases.
Improvements to both computer hardware and software have improved the capacities of conventional database management systems. For example, in the hardware realm, increases in microprocessor performance, coupled with improved memory management systems, have improved the number of queries that a particular microprocessor can perform in a given unit of time. Furthermore, the use of multiple microprocessors and/or multiple networked computers has further increased the capacities of many database management systems.
From a software standpoint, the use of relational databases, which organize information into formally-defined tables consisting of rows and columns, and which are typically accessed using a standardized language such as Structured Query Language (SQL), has substantially improved processing efficiency, as well as substantially simplified the creation, organization, and extension of information within a database.
Furthermore, significant development efforts have been directed toward query “optimization,” whereby the execution of particular searches, or queries, is optimized in an automated manner to minimize the amount of resources required to execute each query. A query optimizer typically generates, for each submitted query, an access plan. In generating an access plan, a query optimizer typically creates multiple potential access plans, and selects the best among those potential access plans based upon the “cost” of each plan. The cost of each plan represents the amount of resources expected to be utilized by the plan (typically expressed in the expected amount of time to execute), and as such, selecting the plan with the lowest cost typically results in the most efficient (and quickest) execution of a query.
One area in which query optimization has been problematic is in connection with fuzzy logic searching. In general, the introduction of fuzzy logic principles to query databases has produced query results that better reflect the intentions of users conducting the queries. Generally, before fuzzy logic, a user had to arbitrarily quantify and/or characterize their intentions via their query, often leading to query results that did not adequately reflect their intentions. For instance, a user may intend to identify “good” employees in a department, e.g., employees with good sales record and a good complaint record. Generally, the higher the sales and the lower the number of unsubstantiated complaints against the employee the better the employee. Nonetheless, without fuzzy logic, the user may have to indicate the minimum sales that the user considers to be a good sales amount, e.g., a sale amount of $50,000 or above, and a number of unsubstantiated complaints the user considers to be a good number of unsubstantiated complaints, e.g., less than or equal to five complaints. The search might be represented as    query1: SELECT * FROM EMPLOYEES WHERE SALES>=50000 AND NUM_UNSUBSTANTIATED_COMPLAINTS<=5 ORDER BY SALES DESC, NUM_UNSUBSTANTIATED_COMPLAINTS ASC
With the query above, which does not incorporate fuzzy logic principles, the user will generally receive exactly what was asked for, mainly a list of employees whose sales are $50,000 or above and the number of their unsubstantiated complaints is less than or equal to five. However, such a search may omit employees with sales of $49,999.99 with zero unsubstantiated complaints or an employee with six unsubstantiated complaints but sales of $100,000, both of whom the user may consider to be good employees. Furthermore, within the results, the user may consider an employee with sales of $50,000 and zero unsubstantiated complaints to be a better employee than one with sales of $51,000 and four unsubstantiated complaints, but such many not be reflected. Instead, the principles of fuzzy logic may be used to process database queries to limit the user from the need to make arbitrary determinations about what constitutes a good employee via the sales and the number of unsubstantiated complaints conditions.
Generally, fuzzy logic allows the sales condition to be expressed along a continuous function of sales and the number of unsubstantiated complaints condition to be expressed along a continuous function of number of unsubstantiated complaints. Group membership along each corresponding continuous function may measured by values in the range 0 to 1. Typically, a value of “1” indicates 100% membership in the group meaning the condition is completely satisfied and a value of “0” indicates 0% membership in the group meaning the condition is not generally satisfied. Values falling between 0 and 1 indicate some membership in the group and may be determined by using algorithms. A value of “0.9876”, for example, may indicate a higher membership in the group than a value of “0.1234”.
Thus, with regard to the previous example, membership in the sales group may be represented with a continuous function where sales more than $100,000, for example, indicates membership of 1 and sales of less than $40,000, for example, indicate a membership of 0. Similarly, membership in the unsubstantiated complaints group may be represented with a continuous function where if an employee has less than two unsubstantiated complaint, for example, then his or her membership in the group is 1. However, if the number of unsubstantiated complaints is more than ten, for example, then the membership is 0. For queries with more than one condition such as this, each group membership may be determined and the weighted average of the group membership values may be used, or the minimum and/or maximum group membership value may be used depending on the operation in a query, e.g., and operation, or operation, not operation, etc, to generate the user's list of good employees. Generally, the list of employees may better proximate the user's intentions of what is a good employee. For instance, employees that would typically have been omitted with the usual query illustrated above may be included, and those employees included are typically ordered based upon how they relatively meet both condition.
As noted above, the introduction of fuzzy logic principles into a database environment can complicate query optimization. In particular, it may be helpful during optimization to estimate the number of records that may be returned when a given condition in a predicate of a database query is processed. Based upon the estimate, the optimizer may be able to better compare the access plans and select the best plan under the runtime conditions.
With that end in mind, the data in a table may be sorted and placed into buckets forming a histogram and descriptive information about the buckets of the histogram may be stored. For instance, the high and low values of the buckets, the number of records in each bucket, and the high and low values of the table may be stored. Thus, optimizers may rely on formulas incorporating the stored data to estimate the number of records that may be returned for a given criteria. Some optimizers also use the formulas to make additional estimates. For example, a formula to estimate the number of records that may be returned that are less than x (e.g., the value of x may be in a predicate of a database query such as table1.field1<x) may be rewritten to estimate the number of records that may be returned that are less than or equal to x.
Generally, these techniques may be appropriate to estimate the number of records that may be returned from a single table for a given criteria; however, it has been found that estimating the number of records that may be returned after a join operation between two tables, also known as a join size, may introduce inaccuracies and/or inconsistencies. As a result, an optimizer may select a suboptimal access plan under the runtime conditions based upon the inaccuracies and inconsistencies. A join is typically an operation used to combine data from two or more different tables according to some criteria. Join operations are common in database queries and are routinely used to access data from relational databases. In particular, the mathematical properties, e.g., (A<B):=1−(A>=B) or (A<=B):=(A<B)+(A=B), used by optimizers via the formulas to estimate join sizes for additional relationships such as these inequalities are applied to values that are generally “fuzzy” in nature. Thus, estimating join sizes for these inequalities is akin to query1 hereinabove, generally leading to inadequate results.
In particular, in modern cost based query optimization, join size estimates (and access plans) used in connection with previous database queries are stored and reused for faster access and recovery. However, reusing estimates in this context becomes problematic in conventional designs. For instance, if a result was already available for A<B as well as for A=B, it would not be sound in conventional designs to combine these results to estimate a result for A>B, because using conventional calculations |A<=B|+|A>B|< >|Cross Product|. Instead, it is possible that |Cross Product|−|A<=B| may result in a value less than, or very close to zero. Negative numbers and numbers close to zero often cause radically different behavior than slightly larger numbers, and are thus undesirable. Furthermore, even when the formulas are not rewritten by optimizers to estimate additional relationships, inconsistencies may result if a user reformulates the database query.
A need therefore exists in the art for an improved approach of estimating a join size, and in particular, an improved approach for estimating join sizes that results in more consistent estimates.