Computers have the capability of storing vast amounts of data. For example, computers can store and retain data related to thousands of employees of large multi-national corporations, including the departments in which they work, their employee numbers, salaries, job descriptions, geographical locations, etc. This data is often stored in the form of tables in a relational database. In order to extract selected portions of that data from such large computerized databases, users can present a query to the database system in the form of a Structured Query Language (SQL) statement. For example, an SQL statement may be used to ask the database system to list the names of all employees having employee numbers 1001 to 2000. A properly structured SQL statement will result in a list of records that satisfies the question or “query.” SQL uses the terms table, row, and column for relation, tuple, and attribute, respectively. Those of skill in the art understand that the terms can be used interchangeably without loss of clarity.
Once a user inputs an SQL query into the computer, an SQL compiler operates on the SQL query to develop an efficient way to extract the desired information from the database. Typically, the SQL compiler converts the SQL statement into a number of relational operators stored in computer memory in the form of a query tree. Each node of the tree represents a relational operator, such as a “sort” or “merge” operator. The optimizer portion of the compiler explores a large number of different logically equivalent forms of the query tree, called “plans”, for executing the same query. The optimizer program selects, for example, the plan with the lowest estimated cost to respond to the query, and that plan is then executed. In database parlance, “cost” is usually measured in terms of the amount of computer resources utilized by the computer in executing the SQL statement, for example, the number of I/O's or CPU instructions.
The prior art has focused on various techniques, such as the use of histograms, for developing statistics to describe the distribution of data in the database tables upon which the database programs operate. Generally, a histogram in this context is a collection of statistics describing a frequency distribution for values in a database over various intervals. It has been recognized that gathering accurate statistics about the data in the tables is important for estimating row selectivity. However, both predicate and relational operators can affect row selectivity and unique entry counts (UEC, Uec, or uec, to be used interchangeably) that are returned by an operator as the associated algorithm processes the query. A unique entry count is a number of the unique entries in an interval or column. As known in the art, predicates in SQL can be, for example, a “basic” predicate which compares two values (e.g., x=y, x is equal to y; x< >y, x is not equal to y; x<y, x is less than y; x>y, x is greater than y; x>=y, x is greater than or equal to y; and, x<=y, x is less than or equal to y); a “quantified” predicate which compares a value or values with a collection of values; a “between” predicate which compares a value with a range of values; an “exists” predicate which tests for the existence of certain rows; or an “in” predicate which compares a value or values with a collection of values. Also as known in the art, relational operators in SQL can be, for example, a “selection” which is used to find all tuples that satisfy a specified selection condition; a “projection” which is used to return, for each tuple, the values under specified attributes; a “cartesian product” which returns a new relation, R3, that contains tuples that can be obtained by concatenating every tuple in one relation, R1, with every tuple in another relation, R2; a “union” which returns all tuples that belong to either one relation, R1, or another relation, R2; a “set difference” which returns all tuples that belong to one relation, R1, but not another relation, R2; a “set intersection” which returns all tuples that belong to both one relation, R1, and another relation, R2; a “join” which returns all tuples in a cartesian product of two relations, R1×R2, satisfying a specified join condition, C, consisting of one or more basic join conditions connected by logical operators (each basic join condition compares attribute values of tuples from different relations); and, a “division” which returns tuples whose concatenation with every tuple of one relation, R1, is in another relation, R2.
The ability to accurately predict the number of rows and UECs returned by both relational operators and predicates is fundamental to computing the cost of an execution plan. The estimated cost, of course, drives the optimizer's ability to select the best plan. Accordingly, there is a need for a method and apparatus that, not only accurately assembles statistics about the tables of raw data to be processed by the database software, but also for a method and apparatus that can accurately predict the number of rows and UECs for each histogram interval that will be returned by any predicate or relational operator in a query tree.