1. Field of the Invention
The present invention relates generally to data processing environments and, more particularly, to a database system with a path based query engine providing methodology for path based query processing.
2. Description of the Background Art
Computers are very powerful tools for storing and providing access to vast amounts of information. Computer databases are a common mechanism for storing information on computer systems while providing easy access to users. A typical database is an organized collection of related information stored as “records” having “fields” of information. As an example, a database of employees may have a record for each employee where each record contains fields designating specifics about the employee, such as name, home address, salary, and the like.
Between the actual physical database itself (i.e., the data actually stored on a storage device) and the users of the system, a database management system or DBMS is typically provided as a software cushion or layer. In essence, the DBMS shields the database user from knowing or even caring about the underlying hardware-level details. Typically, all requests from users for access to the data are processed by the DBMS. For example, information may be added or removed from data files, information retrieved from or updated in such files, and so forth, all without user knowledge of the underlying system implementation. In this manner, the DBMS provides users with a conceptual view of the database that is removed from the hardware level. The general construction and operation of database management systems is well known in the art. See e.g., Date, C., “An Introduction to Database Systems, Seventh Edition”, Part I (especially Chapters 1-4), Addison Wesley, 2000.
In recent years, applications running on database systems frequently provide for business-to-business or business-to-consumer interaction via the Internet between the organization hosting the application and its business partners and customers. Today, many organizations receive and transmit considerable quantities of information to business partners and customers through the Internet. A considerable portion of the information received or exchanged is in Extensible Markup Language or “XML” format. XML is a pared-down version of SGML (Standard Generalized Markup Language), designed especially for Web documents, which allows designers to create their own customized tags, enabling the definition, transmission, validation, and interpretation of data between applications and between organizations. For further description of XML, see e.g., “Extensible Markup Language (XML) 1.0” (Second Edition, Oct. 6, 2000) a recommended specification from the W3C, the disclosure of which is hereby incorporated by reference. A copy of this specification is available via the Internet (e.g., currently at www.w3.org/TR/2000/REC-xml-20001006). Many organizations utilize XML to exchange data with other remote users over the Internet.
Given the increasing use of XML in recent years, many organizations now have considerable quantities of data in XML format, including Web documents, newspaper articles, product catalogs, purchase orders, invoices, and product plans. As a result, these organizations need to be able to efficiently store, maintain, and use this XML information in an efficient manner. However, this XML data is not in a format that can be easily stored and searched in current database systems. Most XML data is sent and stored in plain text format. This data is not formatted in tables and rows like information stored in a relational DBMS. To search this semi-structured data, users typically utilize keyword searches similar to those utilized by many current Internet search engines. These keyword searches are resource-intensive and are not as efficient as relational DBMS searches of structured data.
Organizations with data in XML format also typically have other enterprise data stored in a structured format in database management systems. Increasingly, database system users are demanding that database systems provide the ability to access and use both structured data stored in these databases as well as XML and other unstructured or semi-structured data. In addition, users desire flexible tools and facilities for performing searches of this data.
One of the key roles of a database management system (DBMS) is to retrieve data stored in a database based on specified selection criterion. This typically involves retrieving data in response to a query that is specified in a query language. One particular need is for a solution that will enable efficient searches of information in XML documents. For instance, it would be desirable to have a XML version of SQL (Structured Query Language) that would enable a user to easily retrieve all nodes of type X that have descendants of type Y from a XML document.
One current solution used in XML-based applications to query the contents of a XML document is the XPath query language. XPath is commonly used in Extensible Stylesheet Language Transformations (XSLT) to locate and to apply XSLT templates to specific nodes in a XML document. XPath queries are also commonly used to locate and to process nodes in a XML document that match a specified criteria. XPath provides basic facilities for manipulation of strings, numbers and booleans. It uses a compact, non-XML syntax to facilitate use of XPath within URIs and XML attribute values. XPath operates on the abstract, logical structure of a XML document, rather than its surface syntax. XPath gets its name from its use of a path notation as in URLs for navigating through the hierarchical structure of a XML document. For further description of XPath, see e.g., “XML Path Language (XPath) Version 1.0” (Nov. 16, 1999), a recommended specification from the W3C, the disclosure of which is hereby incorporated by reference. A copy of this specification is available via the Internet (e.g., currently at www.w3c.org/TR/xpath).
Although XPath provides a mechanism for locating nodes in a XML document that match specified criteria, problems remain in the processing of queries written in the XPath query language in current systems. One problem is in generating correct and efficient query plans from XPath expressions. In SQL query processing in current database systems, normalization and preprocessing components or layers are typically used to perform important tasks of semantic validation and tree transformations of SQL queries. In SQL query processing, these layers typically process a raw query tree and transform it into a correct and efficient tree structure for input into the optimizer/code generator of the database system. The optimizer/code generator can then translate this tree structure into more efficient query plans. Existing XPath query engines currently lack these normalization and preprocessing components which can result in incorrect processing of some queries and/or extremely large query plans in some instances.
XPath query parsers of current XPath query solutions generally construct trees assuming that the basic component of the tree is an element or attribute. However, XML storage and access in such systems is based on distinct paths in a XML document under consideration. In current systems, XML documents are frequently stored as collection of paths (path index) and value index. As a result, documents or fragments of documents are accessed using path based scans. However, in current XPath query engines, paths are constructed at the last moment; namely, at code generation time. Existing systems separately deal with components of paths; namely, elements, attributes and wildcards. Other XPath operators such as descendants, parenthesis, filters (predicates) are also processed without an understanding of semantic relationships between paths and the other XPath operators. The element-by-element processing in the query engine and path based access in the storage layer, represent an inherent mismatch (referred to as an “impedence mismatch”) in the query processing model of current systems. This impedence mismatch between the output of parser (which is input to the code generator) and the output of the code generator results in an inefficient and sometimes error prone code generation process in processing path based queries in current systems. The problems which can result include large plans which may cause stack overflows in some cases and incorrect plans causing stack traces in certain other cases.
What is needed is a solution which provides improved processing of path based queries. Ideally, the solution should transform element-based parse trees into path-based parse trees so as to enable improved processing of XPath queries. The present invention provides a solution for these and other needs.