The development of the EDVAC computer system of 1948 is often cited as the beginning of the computer era. Since that time, computer systems have evolved into extremely sophisticated devices, and computer systems may be found in many different settings. Computer systems typically include a combination of hardware, such as semiconductors and circuit boards, and software, also known as computer programs.
Fundamentally, computer systems are used for the storage, manipulation, and analysis of data, which may be anything from complicated financial information to simple baking recipes. It is no surprise, then, that the overall value or worth of a computer system depends largely upon how well the computer system stores, manipulates, and analyzes data. One mechanism for managing data is called a database management system (DBMS), which may also be called a database system or simply a database.
Many different types of databases are known, but the most common is usually called a relational database (RDB), which organizes data in tables that have rows, which represent individual entries or records in the database, and columns, which define what is stored in each entry or record. Each table has a unique name within the database and each column has a unique name within the particular table. The database also has an index, which is a data structure that informs the database management system of the location of a certain row in a table given an indexed column value, analogous to a book index informing the reader on which page a given word appears.
To be useful, the data stored in databases must be capable of being retrieved in an efficient manner. The most common way to retrieve data from a database is through statements called database queries, which may originate from user interfaces, application programs, or remote systems, such as clients or peers. A query is an expression evaluated by the database management system. As one might imagine, queries range from being very simple to very complex. Although the query requires the return of a particular data set in response, the method of query execution is typically not specified by the query. Thus, after the database management system receives a query, the database management system interprets the query and determines what internal steps are necessary to satisfy the query. These internal steps may include an identification of the table or tables specified in the query, the row or rows selected in the query, and other information such as whether to use an existing index, whether to build a temporary index, whether to use a temporary file to execute a sort, and/or the order in which the tables are to be joined together to satisfy the query.
When taken together, these internal steps are referred to as an execution plan, an access plan, or just a plan. The access plan is typically created by a software component of the database management system that is often called a query optimizer. The query optimizer may be part of the database management system or separate from, but in communication with, the database management system. When a query optimizer creates an access plan for a given query, the access plan is often saved by the database management system in the program object, e.g., the application program, that requested the query. The access plan may also be saved in an SQL (Structured Query Language) package or an access plan cache. Then, when the user or program object repeats the query, which is a common occurrence, the database management system can find and reutilize the associated saved access plan instead of undergoing the expensive and time-consuming process of recreating the access plan. Thus, reusing access plans increases the performance of queries when performed by the database management system.
Many different access plans may be created for any one query, each of which returns the required data set, yet the different access plans may provide widely different performance. Thus, especially for large databases, the access plan selected by the database management system needs to provide the required data at a reasonable cost in terms of time and hardware resources. Hence, the query optimizer often creates multiple prospective access plans and then chooses the best, or least expensive one, to execute.
One factor that contributes to the cost of a particular access plan is the number of rows that a query using that access plan returns from a database table. A query that returns a large number of rows may run most efficiently with one access plan, while a query that returns only a small number of rows may run most efficiently with a different access plan. Hence, in an attempt to choose the best access plan for a particular query, current query optimizers estimate the number of rows that the query will return when executed based on the number of unique values in a column of the table to which the query is directed. This number of unique values is called the cardinality of the column.
While using the cardinality of a column as an estimate for the number of rows returned by a query may work well for conventional queries, it does not work well for recursive queries. A recursive query returns rows that have relationships to an arbitrary depth in a table and provides an easier way of traversing tables that represent tree or graph data structures. For example, given a table that represents the reporting relationships within a company, a recursive query may return all employees that report, directly or indirectly, to one particular person. Recursive queries typically contain an initial sub-query, a seed, and a recursive sub-query that, during each iteration, appends additional rows to the result set. An example of a recursive query is the SQL (structured query language) recursive common table expression (RCTE). Unfortunately, the cardinality function merely calculates the number of unique values in a column and ignores the recursive nature of the query and the relationships of the data within the column. Thus, using merely the cardinality of a column as an estimate for the number of rows returned by a query does not work well for recursive queries. Hence, conventional query optimizers experience difficulty choosing the most efficient access path for recursive queries.
Thus, there is a need for a technique configured to estimate the number of rows returned by a recursive query in a database environment.