Computers are very powerful tools for storing and providing access to vast amounts of information. Computer databases are a common mechanism for storing information on computer systems while providing easy access to users. A typical database is an organized collection of related information stored as “records” having “fields” of information. As an example, a database of employees may have a record for each employee where each record contains fields designating specifics about the employee, such as name, home address, salary, and the like.
Between the actual physical database itself (i.e., the data actually stored on a storage device) and the users of the system, a database management system or DBMS is typically provided as a software cushion or layer. In essence, the DBMS shields the database user from knowing or even caring about the underlying hardware-level details. Typically, all requests from users for access to the data are processed by the DBMS. For example, information may be added or removed from data files, information retrieved from or updated in such files, and so forth, all without user knowledge of the underlying system implementation. In this manner, the DBMS provides users with a conceptual view of the database that is removed from the hardware level. The general construction and operation of database management systems is well known in the art. See e.g., Date, C., “An Introduction to Database Systems, Seventh Edition”, Part I (especially, Chapters 1-4), Addison Wesley, 2000.
In recent years, applications running on database systems frequently provide for business-to-business or business-to-consumer interaction via the Internet between the organization hosting the application and its business partners and customers. Today, many organizations receive and transmit considerable quantities of information to business partners and customers through the Internet. A considerable portion of the information received or exchanged is in Extensible Markup Language or “XML” format. XML is a pared-down version of SGML (Standard Generalized Markup Language), designed especially for Web documents, which allows designers to create their own customized tags, enabling the definition, transmission, validation, and interpretation of data between applications and between organizations. For further description of XML, see e.g., “Extensible Markup Language (XML) 1.0” (Second Edition, Oct. 6, 2000) a recommended specification from the W3C, the disclosure of which is hereby incorporated by reference. A copy of this specification is available via the Internet (e.g., currently at www.w3.org/TR/2000/REC-xml-20001006). Many organizations utilize XML to exchange data with other remote users over the Internet.
Given the increasing use of XML in recent years, many organizations now have considerable quantities of data in XML format, including Web documents, newspaper articles, product catalogs, purchase orders, invoices, and product plans. As a result, these organizations need to be able to efficiently store, maintain, and use this XML information in an efficient manner. However, this XML data is not in a format that can be easily stored and searched in current database systems. Most XML data is sent and stored in plain text format. This data is not formatted in tables and rows like information stored in a relational DBMS. To search this semi-structured data, users typically utilize keyword searches similar to those utilized by many current Internet search engines. These keyword searches are resource-intensive and are not as efficient as relational DBMS searches of structured data.
Organizations with data in XML format also typically have other enterprise data stored in a structured format in database management systems. Increasingly, database system users are demanding that database systems provide the ability to access and use both structured data stored in these databases as well as XML and other unstructured or semi-structured data. In addition, users desire flexible tools and facilities for performing searches of this data.
One of the key roles of a database management system (DBMS) is to retrieve data stored in a database based on specified selection criterion. This typically involves retrieving data in response to a query that is specified in a query language. One particular need is for a solution that will enable efficient searches of information in XML documents. For instance, it would be desirable to have an XML version of SQL (Structured Query Language) that would enable a user to easily retrieve all nodes of type X that have descendants of type Y from an XML document.
One current solution used in XML-based applications to query the contents of an XML document is XPath. The XPath query language is commonly used in Extensible Stylesheet Language Transformations (XSLT) to locate and to apply XSLT templates to specific nodes in an XML document. XPath queries are also commonly used to locate and to process nodes in an XML document that match a specified criteria. XPath provides basic facilities for manipulation of strings, numbers and Booleans. It uses a compact, non-XML syntax to facilitate use of XPath within URLs and XML attribute values. XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax. XPath gets its name from its use of a path notation as in URLs for navigating through the hierarchical structure of an XML document. For further description of XPath, see e.g., “XML Path Language (XPath) Version 2.0” (Jan. 23, 2007), a recommended specification from the W3C, the disclosure of which is hereby incorporated by reference. A copy of this specification is available via the Internet (e.g., currently at http://www.w3.org/TR/xpath20/).
To extract information embedded in XML data efficiently, native binary store of the XML data in database system tables has been used. However, with the size of the XML file becoming bigger and bigger, a more concise format is needed to reduce the size of the native XML binary. The present invention addresses this need.