This invention relates to the filed of database query optimizers, and more particularly, to an improved method and apparatus for using and manipulating histogram statistics to more accurately estimate the number of rows and unique entry counts where a "UECs" is the number of unique values represented within any particular interval of a histogram, in each histogram interval that will be produced by relational operators and predicates in a database query system. Where relational operators are operators that receive one or more tables as input and produce a new table as an output. Join, Union and Union All are examples of operators that receive two tables as inputs. Group-by and Sort are examples of relational operators that receive only one table as input. Relational operators contain or specify the predicates applied during their execution. In addition a "predicate" is an operation that specifies a comparison between two values, e.g., equal to, greater than, not equal to, greater than or equal to, less than or equal to, less than or is null. The method and apparatus can accurately model run time data flow through the nodes of a query tree, thereby enabling the associated optimizer to accurately select the best plan for a particular performance goal.
Computers have the capability of storing vast amounts of data. For example, computers can store and retain data related to thousands of employees of large multi-national corporations, including the departments in which they work, their employee numbers, salaries, job descriptions, geographical locations, etc. This data is often stored in the form of tables in a relational database. In order to extract selected portions of that data from such large computerized databases, users can present a query to the database system in the form of an SQL statement. For example, an SQL statement may be used to ask the database system to list the names of all employees having employee numbers 1001 to 2000. A properly structured SQL statement will result in a list of records that satisfies the question or "query." In this example, the query would produce the names of 1000 employees, assuming that the employees had sequential employee numbers.
Once a user inputs an SQL query into the computer, an SQL compiler operates on the SQL query to develop an efficient way to extract the desired information from the database. Typically, the SQL compiler converts the SQL statement into a number of relational operators stored in computer memory in the form of a query tree. Each node of the tree represents a relational operator, such as a "sort" or "merge" operator. The optimizer portion of the compiler explores a large number of different logically equivalent forms of the query tree, called "plans", for executing the same query. The optimizer program selects, for example, the plan with the lowest estimated cost to respond to the query, and that plan is then executed. In database parlance, "cost" is usually measured in terms of the amount of computer resources utilized by the computer in executing the SQL statement, for example, the number of I/O's or CPU instructions.
The prior art has focused on various techniques, such as the use of histograms, for developing statistics to describe the distribution of data in the database tables upon which the database programs operate. For example, it has been recognized that gathering accurate statistics about the data in the tables is important to the estimate of predicate selectivity. However, both predicate and relational operators can affect the number of rows and UECs that are returned by an operator as the associated algorithm processes the query. The ability to accurately predict the number of rows and UECs returned by both relational operators and predicates is fundamental to computing the cost of an execution plan. The estimated cost, of course, drives the optimizer's ability to select the best plan. Accordingly, there is a need for a method and apparatus that, not only accurately assembles statistics about the tables of raw data to be processed by the database software, but also for a method and apparatus that can accurately predict the number of rows and UECs for each histogram interval that will be returned by any predicate or relational operator in a query tree.