The extensible markup language (XML) is a meta-language developed and standardized by the World Wide Web Consortium (W3C) that permits use and creation of customized markup languages for different types of documents. XML is a variant of and is based on the Standard Generalized Markup Language (SGML), the international standard meta-language for text markup systems that is also the parent meta-language for the Hyper-Text Markup Language (HTML).
Since its adoption as a standard language, XML has become widely used to describe and implement many kinds of document types. Increasingly greater amounts of content are being created and stored as XML documents in modern computing systems, with the XML documents often being stored in database management systems. Therefore, there is a growing demand for database systems that provide capabilities to store, manage and query XML content natively in a database. As such, mechanisms for efficient storage and querying of arbitrary XML data is becoming important in building a scalable and robust content management platform.
The content of XML documents may be structured or unstructured. Structured data will conform to an XML schema. Unstructured data may not be associated with any specifically identifiable schema. For example, unstructured XML documents may be created as a result of ad hoc editing. As another example, an unstructured XML document may be created by combining multiple structured documents together into an unstructured collection. There are many scenarios in which users need to store and query XML documents that do not conform to any pre-defined XML schemas.
One of the severe limitations of conventional databases that work with XML data is the lack of efficient processing for schema-less XML documents, particularly when attempting to perform XPath processing on these schema-less documents. XPath is a language for addressing parts of an XML document that has been defined by the W3C organization, in which the parts of an XML document are modeled as a tree of nodes. Further information about the XPath language can be found at the W3C website at http://www.w3.org/TR/xpath, the contents of which are incorporated herein by reference in its entirety. Queries involving XPath predicates are often used to filter XML documents and extract fragments within documents.
In many cases, documents that do not conform to an XML Schema can only be stored in CLOB columns. However, this mode of storage impacts the performance of XPath-based searches. Inverted indexes and functional indexes can be used to improve certain types of filter queries. However, the more general form of filter queries which involve range predicates and collection traversals are still not satisfied by such indexes, and hence require inefficient DOM-based evaluation. Moreover, functional indexes can be built only on XPath expressions returning a single value. If the XPath expression returns more than one value, a functional index cannot be created. An inverted list index serves as a primary filter but needs an expensive functional evaluation of the XPath as a post-filter operation. The post-filter step is a significant bottleneck especially for large documents. Finally, neither of the two indexing options are effective in extracting fragments based on user specified XPaths.
Embodiments of the present invention disclose a new approach for storing, accessing, and managing data, such as XML data. Also disclosed are embodiments of new storage formats for string XML data. The approach supports efficient evaluation of XPath queries and also improves the performance of data/fragment extraction, and can be applied to schema-less documents. The invention is applicable to all database systems and other servers which support storing and managing XML content. In addition, the approach can be applied to store, manage, and retrieve other types of unstructured or semi-structured data in a database system.
Further details of aspects, objects, and advantages of the invention are described below in the detailed description, drawings, and claims. Both the foregoing general description and the following detailed description are exemplary and explanatory, and are not intended to be limiting as to the scope of the invention.