Database Query Optimizer
In a database system, data is stored in one or more data containers, each container contains records, and the data within each record is organized into one or more fields. In relational database systems, the data containers are referred to as tables, the records are referred to as rows, and the fields are referred to as columns. In object oriented databases, the data containers are referred to as object classes, the records are referred to as objects, and the fields are referred to as attributes. Other database architectures may use other terminology.
A database management system (DBMS) retrieves and manipulates data in response to receiving a database statement. Typically, the database statement conforms to a database language, such as the Structured Query Language (SQL). A database statement can specify a query operation, a data manipulation operation, or a combination thereof. A database statement that specifies a query operation is referred to herein as a query.
When a DBMS receives a query, the DBMS may generate an execution plan. An execution plan is important because it defines the steps and operations performed by a DBMS to service a request. DBMSs often include an optimizer for generating execution plans that are optimized for efficiency. When determining what steps to include in an execution plan, and the order in which the steps are performed, a DBMS accounts for many factors that affect efficiency. One important factor that is considered is the computational cost associated with executing a query according to a given execution plan. A cost-based optimizer (CBO) evaluates all possible data access paths for a query and determines the most efficient execution plan based on the cost of all access paths.
For example, a query with two ANDed predicates will request rows that satisfy both predicates. If the column(s) in the first predicate is indexed, then a DBMS may generate an execution plan that uses the index to access data more efficiently.
To determine an efficient execution plan for a query, the query optimizer relies on persistently stored statistics to estimate the costs of alternative execution plans, and chooses the plan with the lowest overall estimated cost. The statistics are computed and stored before the query is received. Statistics are used to estimate important optimizer cost parameters such as the selectivity of various predicates and predicate clauses (e.g., the fraction or percentage of rows in a table that match some condition represented in a predicate or predicate clause). Examples of statistics include table cardinalities (the number of rows in a table), the number of distinct values for a column, the minimum and maximum values in the column, and histograms, which is data that specifies the distribution of values in the columns, e.g., the number of rows that have particular column values for a column or the number of rows that have a column value that falls within a range. However, for some database statements, statistics needed by the query optimizer may not be available, such as statistics for certain repositories managed by the DBMS.
XML Database
With support for XML type data as a native data type in information management systems, such as a relational database system (RDBMS) or object-relational database system (ORDBMS), the contents of XML documents can be stored in such systems. For example, in the context of a relational database, XML data may be stored in columns of a relational table and users can query the XML data via a SQL query.
One known implementation of an XML data repository, which provides the mechanisms for the storage of XML data in a RDBMS and access thereto, is referred to herein as an XML database (“XDB”). The key XDB-enabling technologies can be grouped into two major classes: (1) XML data type, which provides a native XML storage and retrieval capability strongly integrated with SQL; and (2) XDB repository, which provides foldering, access control, versioning, and the like, for XML resources.
The XML data type can be used as a datatype of a column of a relational table, and includes a number of useful methods to operate on XML data. XML type data can be stored, for example, as a LOB (large object) or according to object-relational storage. If stored as a LOB, XML data may be accessed via a text index, and if stored object relationally, XML data may be accessed via a btree index, for example. Some benefits that result from the XML data type include support for XML schemas, XPath searches, XML indexes, XML operators, XSL transformations, and XDB repository views (e.g., RESOURCE_VIEW and PATH_VIEW, described hereafter).
The XDB repository provides a repository for managing XML data. The XDB repository provides important functionality with respect to the XML data, for example, access control lists (ACL), foldering, WebDAV (Web-based Distributed Authoring and Versioning), FTP (File Transfer Protocol) and JNDI (Java Naming and Directory Interface) access, SQL repository search, hierarchical indexing, and the like.
XDB repository views provide a mechanism for SQL access to data that is stored in the XDB. Data stored in XDB repository via protocols like FTP, WEBDAV or JNDI can be accessed in SQL via these views. XDB provides two repository views to enable SQL access to the repository: RESOURCE_VIEW and PATH_VIEW. Both views contain the resource properties, the path names and resource IDs. The PATH_VIEW has an additional column for the link properties.
With prior approaches to cost-based optimizers, the optimizers were unable to retrieve the real cost of a query on XDB repository views, so the optimizer relied on default statistics to choose a query execution plan. Since the CBO is not aware of the implementation of XDB repository views and user defined operators associated to the views, CBO can only estimate the default statistics, which is far from being accurate. Thus, the result is sub optimal query execution plans. For example, in the absence of an optimizer mechanism for an XDB repository, the CBO may choose a sub optimal query plan involving both a hierarchical index scan and a btree index scan, where the selectivity of the predicate with the XDB operator is very high while the selectivity of the predicate with the btree index on it is very low. In such a scenario, the optimal query plan would be a btree index scan followed by functional evaluation of the repository view operators.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.