A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.
A Computer Program Listing Appendix, containing fourteen (14) total files on compact disc, is included with this application.
1. Field of the Invention
The present invention relates generally to information processing environments and, more particularly, to a database system providing methods that handle, manage, and store information in Extensible Markup Language (XML) format and that support queries of XML documents and data.
2. Description of the Background Art
Computers are very powerful tools for storing and providing access to vast amounts of information. Computer databases are a common mechanism for storing information on computer systems while providing easy access to users. A typical database is an organized collection of related information stored as xe2x80x9crecordsxe2x80x9d having xe2x80x9cfieldsxe2x80x9d of information. As an example, a database of employees may have a record for each employee where each record contains fields designating specifics about the employee, such as name, home address, salary, and the like.
Between the actual physical database itself (i.e., the data actually stored on a storage device) and the users of the system, a database management system is typically provided as a software cushion or layer. In essence, the DBMS shields the database user from knowing or even caring about underlying hardware-level details. Typically, all requests from users for access to the data are processed by the DBMS. For example, information may be added or removed from data files, information retrieved from or updated in such files, and so forth, all without user knowledge of the underlying system implementation. In this manner, the DBMS provides users with a conceptual view of the database that is removed from the hardware level. The general construction and operation of a database management system is known in the art. See e.g., Date, C., xe2x80x9cAn Introduction to Database Systems, Volume I and II,xe2x80x9d Addison Wesley, 1990, the disclosure of which is hereby incorporated by reference.
DBMS systems have long since moved from a centralized mainframe environment to a de-centralized or distributed environment. One or more PC xe2x80x9cclientxe2x80x9d systems, for instance, may be connected via a network to one or more server-based database systems. Commercial examples of these xe2x80x9cclient/serverxe2x80x9d systems include Powersoft(copyright) clients connected to one or more Sybase(copyright) Adaptive Server(copyright) database servers. Both Powersoft(copyright) and Sybase(copyright) Adaptive Server(copyright) (formerly Sybase(copyright) SQL Server(copyright)) are available from Sybase, Inc. of Emeryville, Calif.
In recent years, this distributed environment has shifted from a standard two-tier client/server environment to a three-tier client/server architecture. This newer client/server architecture introduces three well-defined and separate processes, each typically running on a different platform. A xe2x80x9cfirst tierxe2x80x9d provides the user interface, which runs on the user""s computer (i.e., the client). Next, a xe2x80x9csecond tierxe2x80x9d provides the functional modules that actually process data. This middle tier typically runs on a server, often called an xe2x80x9capplication server.xe2x80x9d A xe2x80x9cthird tierxe2x80x9d furnishes a database management system (DBMS) that stores the data required by the middle tier. This tier may run on a second server called the xe2x80x9cdatabase server.xe2x80x9d Three-tier database systems are well documented in the patent and trade literature, see e.g., commonly-owned U.S. Pat. No. 6,266,666, entitled xe2x80x9cComponent transaction server for developing and deploying transaction-intensive business applications,xe2x80x9d the disclosure of which is hereby incorporated by reference.
More recently, the first tier (or client) for many three-tier systems is accessing the second-tier application server through the Internet, typically using a Web browser, such as Netscape Navigator or Microsoft Internet Explorer. Increasingly, applications running on these systems provide for business-to-business or business-to-consumer interaction via the Internet between the organization hosting the application and its business partners and customers. Many organizations receive and transmit information to business partners and customers through the Internet.
A considerable portion of the information received or exchanged is in Extensible Markup Language or xe2x80x9cXMLxe2x80x9d format. XML is a pared-down version of SGML, designed especially for Web documents, which allows designers to create their own customized tags, enabling the definition, transmission, validation, and interpretation of data between applications and between organizations. For further description of XML, see e.g., xe2x80x9cExtensible Markup Language (XML) 1.0,xe2x80x9d (2nd Edition, Oct. 6, 2000) a recommended specification from the W3C, the disclosure of which is hereby incorporated by reference. A copy of this specification is currently available on the Internet at http://www.w3.org/TR/2000/REC-xml-20001006. Many organizations utilize XML to exchange data with other remote users over the Internet.
As an increasing amount of information is in XML format, users want to be able to search this XML information in an efficient manner. However, this XML data is not in a format that can be easily stored and searched in current database systems. Most XML data is sent and stored in plain text format. This data is not formatted in tables and rows like information stored in a relational DBMS. To search this semi-structured data, users typically utilize keyword searches similar to those utilized by many current Internet search engines. These keyword searches are resource-intensive and are not as efficient as relational DBMS searches of structured data. For example, a user may perform a keyword search for xe2x80x9cHarrison Ford.xe2x80x9d This keyword search would not only return information regarding the actor with this name, but would also typically return information on Ford automobiles or dealerships owned by Harrison.
Given the increasing use of XML in recent years, many organizations now have considerable quantities of data in XML format, including Web documents, newspaper articles, product catalogs, purchase orders, invoices, and product plans. To extract and store this XML data, developers currently utilize Document Object Model (xe2x80x9cDOMxe2x80x9d) processors. DOM itself represents a specification of how objects in HTML and XML documents (text, images, headers, links, and the like) are represented. These DOM processors are very application-specific as developers typically write specific drivers for specific applications and types of data. For example, one particular processor may deal primarily with large static documents and catalogs, while another processor may handle smaller and more dynamic data such as orders, invoices, messages, and other types of data.
Documents and catalogs can be quite large and contain a lot of information, but are generally static. Searching large catalogs using these existing DOM processors is inefficient, as it requires reparsing the document and searching by text keyword. Both of these processes are very resource-intensive. Orders, invoices, and messages are usually smaller, but more frequent in number and more dynamic. In the case of these types of data, there is a need to be able to efficiently locate the appropriate item from a large group of similar items. For example, a user may wish to find invoices sent to a particular customer without having to search through all of the invoices in the system.
Another category of semi-structured information currently handled by these DOM processors is xe2x80x9ctransitional RDBMS data.xe2x80x9d Transitional RDBMS data is information that may be stored by an organization in a relational DBMS, but that is exchanged with another company (or another group within the same company) in XML format. For example, Company A sends certain information from its product database to Company B in XML format. Existing systems provide no effective mechanisms for Company B to store this XML data in Company B""s relational DBMS or to search this information in the same fashion as other data in Company B""s relational DBMS.
Organizations with data in XML format also typically have other enterprise data stored in a structured format in a relational DBMS. Another current problem is that the many existing relational DBMS applications used by these organizations cannot easily access both structured data stored in these databases as well as XML and other unstructured or semi-structured data. In addition, current systems do not enable searches of XML data using established Structure Query Language (SQL) queries and search methodologies.
As yet another problem, current DOM processing systems are also cumbersome to maintain. With these existing DOM processors, new code typically must be written for every new addition to the XML definition. There is currently no flexible, repeatable solution for efficiently extracting, storing, and searching XML data. As a result, the advantages of XML have yet to be realized in the enterprise.
What is needed is an improved database system with built-in support for performing several key tasks in handling and managing XML content. Such a system should be able to decompose and extract data in XML format, and do so in a manner that permits full utilization of such data within a business"" enterprise software systems. For example, with a purchase order, a user should have access to a database system that is able to separate and process product information, customer information (such as the customer name and shipment address) and pricing. The solution should enable users to search this transformed XML data using well-known database search tools and methodologies, rather than requiring the use of less efficient keyword searches. Additionally, the solution should also enable users to recompose the XML data in order to recreate the original document, message or object and its context. The present invention fulfills these and other needs.
The following definitions are offered for purposes of illustration, not limitation, in order to assist with understanding the discussion that follows.
DOM: DOM is short for Document Object Model, the specification for how objects in a Web page (text, images, headers, links, etc.) are represented. The Document Object Model defines what attributes are associated with each object, and how the objects and attributes can be manipulated. Dynamic HTML (DHTML) relies on the DOM to dynamically change the appearance of Web pages after they have been downloaded to a user""s browser. For further information on DOM, see e.g., xe2x80x9cDocument Object Model (DOM) Level 3 Core Specification, Version 1.0,xe2x80x9d World Wide Web Consortium Working Draft (Sep. 13, 2001), the disclosure of which is hereby incorporated by reference. A copy of this draft specification is currently available from the World Wide Web Consortium (W3C) via the Internet at http://www.w3.org/DOM.
HTML: HTML stands for HyperText Markup Language. Every HTML document requires certain standard HTML tags in order to be correctly interpreted by Web browsers. Each document consists of head and body text. The head contains the title, and the body contains the actual text that is made up of paragraphs, lists, and other elements. Browsers expect specific information because they are programmed according to HTML and SGML specifications. Further description of HTML documents is available in the technical and trade literature, see e.g., Duncan, R. xe2x80x9cPower Programming: An HTML Primer,xe2x80x9d PC Magazine, Jun. 13, 1995, the disclosure of which is hereby incorporated by reference.
HTTP: HTTP is the acronym for HyperText Transfer Protocol, which is the underlying communication protocol used by the World Wide Web on the Internet. HTTP defines how messages are formatted and transmitted, and what actions Web servers and browsers should take in response to various commands. For example, when a user enters a URL in his or her browser, this actually sends an HTTP command to the Web server directing it to fetch and transmit the requested Web page. Further description of HTTP is available in xe2x80x9cRFC 2616: Hypertext Transfer Protocolxe2x80x94HTTP/1.1,xe2x80x9d the disclosure of which is hereby incorporated by reference. RFC 2616 is available from the W3C, and is currently available via the Internet at http://www.w3.org/Protocols/. Additional description of HTTP is available in the technical and trade literature, see e.g., Stallings, W. xe2x80x9cThe Backbone of the Web,xe2x80x9d BYTE, October 1996, the disclosure of which is hereby incorporated by reference.
Java: Java is a general purpose programming language developed by Sun Microsystems. Java is an object-oriented language similar to C++, but simplified to eliminate language features that cause common programming errors. Java source code files (files with a java extension) are compiled into a format called bytecode (files with a class extension), which can then be executed by a Java interpreter. Compiled Java code can run on most computers because Java interpreters and runtime environments, known as Java Virtual Machines (JVMs), exist for most operating systems, including UNIX, the Macintosh OS, and Windows. Bytecode can also be converted directly into machine language instructions by a just-in-time (JIT) compiler. Further description of the Java Language environment can be found in the technical, trade, and patent literature; see e.g., Gosling, J. et al., xe2x80x9cThe Java Language Environment: A White Paper,xe2x80x9d Sun Microsystems Computer Company, October 1995, the disclosure of which is hereby incorporated by reference.
Meta data: Meta data is data about data. Meta data describes how a particular set of data was collected, and how the data is formatted. Meta data may also describe when data was collected and by whom it was collected. Meta data is very useful for understanding information stored in data warehouses.
SGML: SGML stands for Standard Generalized Markup Language, a system for organizing and tagging elements of a document. SGML was developed and standardized by the International Organization for Standardization (ISO), see e.g., International Organization for Standardization, ISO 8879: xe2x80x9cInformation processingxe2x80x94Text and office systemsxe2x80x94Standard Generalized Markup Language (SGML),xe2x80x9d ([Geneva]: ISO, 1986), the disclosure of which is hereby incorporated by reference. SGML itself does not specify any particular formatting; rather, it specifies the rules for tagging elements. These tags can then be interpreted to format elements in different ways. For an introduction to SGML, see e.g., xe2x80x9cA Gentle Introduction to SGML,xe2x80x9d 1995, chapter 2 of xe2x80x9cGuidelines for Electronic Text Encoding and Interchange (TEI)xe2x80x9d produced by the Text Encoding Initiative, the disclosure of which is hereby incorporated by reference. A copy of xe2x80x9cA Gentle Introduction to SMGLxe2x80x9d is currently available via the Internet at http://www.uic.edu/orgs/tei/sgml/teip3sg/SG.htm.
SQL: SQL stands for Structured Query Language, which has become the standard for relational database access, see e.g., Melton, J. (ed.), xe2x80x9cAmerican National Standard ANSI/ISO/IEC 9075-2: 1999, Information Systemsxe2x80x94Database Languagexe2x80x94SQL Part2: Foundation,xe2x80x9d the disclosure of which is hereby incorporated by reference. For additional information regarding SQL in database systems, see e.g., Date, C., xe2x80x9cAn Introduction to Database Systems, Volume I and II,xe2x80x9d Addison Wesley, 1990, the disclosure of which is hereby incorporated by reference.
TCP: TCP stands for Transmission Control Protocol. TCP is one of the main protocols in TCP/IP networks. Whereas the IP protocol deals only with packets, TCP enables two hosts to establish a connection and exchange streams of data. TCP guarantees delivery of data and also guarantees that packets will be delivered in the same order in which they were sent. For an introduction to TCP, see e.g., RFC 793, the disclosure of which is hereby incorporated by reference. A copy of RFC 793 is currently available at http://www.ietf.org.
TCP/IP: TCP/IP stands for Transmission Control Protocol/Internet Protocol, the suite of communications protocols used to connect hosts on the Internet. TCP/IP uses several protocols, the two main ones being TCP and IP. TCP/IP is built into the UNIX operating system and is used by the Internet, making it the de facto standard for transmitting data over networks. For an introduction to TCP/IP, see e.g., xe2x80x9cRFC 1180: A TCP/IP Tutorial,xe2x80x9d the disclosure of which is hereby incorporated by reference. A copy of RFC 1180 is currently available at ftp://ftp.isi.edu/in-notes/rfc1180.txt.
URL: URL is an abbreviation of Uniform Resource Locator, the global address of documents and other resources on the World Wide Web. The first part of the address indicates what protocol to use, and the second part specifies the IP address or the domain name where the resource is located.
XML: XML stands for Extensible Markup Language, a specification developed by the W3C. XML is a pared-down version of SGML, designed especially for Web documents. It allows designers to create their own customized tags, enabling the definition, transmission, validation, and interpretation of data between applications and between organizations. For further description of XML, see e.g., xe2x80x9cExtensible Markup Language (XML) 1.0,xe2x80x9d (2nd Edition, Oct. 6, 2000) a recommended specification from the W3C, the disclosure of which is hereby incorporated by reference. A copy of this specification is currently available on the Internet at http://www.w3.org/TR/2000/REC-xml-20001006.
XQL: XQL refers to a standard XML Query Language proposed to the W3C consortium, XSL working group in 1998. For further description of the proposal, see e.g., xe2x80x9cXML Query Language (XQL),xe2x80x9d a W3C working draft (Jun. 7, 2001), the disclosure of which is hereby incorporated by reference. This draft specification is available from the W3C and is currently available via the Internet at http://www.w3.org/TandS/QL/QL98/pp/xql.html. Currently, XQL is the most commonly used language for querying XML documents.
The present invention provides a system including methods enabling data in Extensible Markup Language (xe2x80x9cXMLxe2x80x9d) format to be extracted, transformed, and persistently stored in a relational database. This extraction and transformation process is generalized, and can be used on various types of data from various sources. During the process of extraction and transformation of XML data, the present invention creates and uses meta data structures to enable faster access to the XML data.
The XML Query Support Engine of the present invention includes an XML Store Engine, a Path Processor and an XQL Engine. The XML Store Engine includes parse time functionality that transforms each XML document into a collection of bytes, called xe2x80x9cSybXMLData,xe2x80x9d that can be stored in a database or file system. Furthermore, a streaming interface over this SybXMLData called xe2x80x9cSybXMLStream,xe2x80x9d is defined to provide fast, random access to the structures within it. In this document the terms xe2x80x9cSybXMLDataxe2x80x9d and xe2x80x9cSybXMLStreamxe2x80x9d are used interchangeably to refer to this data and streaming interface. The SybXMLStream includes a fast access structure, which is a flexible, persistent interface that enables free movement amongst, and efficient access to the underlying XML data. The XML Store Engine also has query execution-time functionality for retrieving data in response to query plans specified by the XQL Engine. It enables greater efficiency as only the relevant portions of the underlying XML data are brought into memory in response to a query. The system also enables the original XML document (or a portion thereof) to be recomposed when required.
The Path Processor serves as an interface between the XML Store Engine and the XQL Engine. The Path Processor abstracts the interactions with the XML Store Engine to a higher level, enabling the XQL Engine (as well as other different query engines) to more easily access data from the XML Store Engine.
The XQL Engine of the present invention uses a query language known as XQL, which enables querying of XML data without the need to write custom application-specific navigation code to search different types of XML data. The XQL Engine parses and translates queries into a structure that can be executed against the XML Store Engine.
Utilizing the present invention, XML data can be stored once and queried many times. The XML Query Support Engine enables XML data to be searched using standard database query methodologies, rather than inefficient keyword text searches. The system is scalable and is also easier to maintain, as it does not need to be rewritten each time the XML definition is changed. In the currently preferred embodiment, the XML Query Support Engine is written in Java, enabling the system to be deployed on a number of different platforms. Moreover, the system provides access to data that may reside on different machines in multiple locations, as the invocation of the XML Query Support Engine is independent of document location.