1. Field of the Invention
The present invention relates in general to computer-implemented database systems, and, in particular, to an extender for a computer-implemented relational database system for storing, querying, and retrieving structured documents. The present invention further relates to other features of an extender for a computer-implemented relational database system, including indexing of structured documents with general and rich data types, querying structured documents using a novel conditional select function; and creating structure indexes using a novel tag counting system.
2. Description of the Related Art
HyperText Markup Language (HTML) has been the standard format for delivering information on the World Wide Web (WWW). HTML documents are very well suited for Web browsing, as opposed to automated information processing, which can be difficult because of the few semantics associated with the documents. Just as in a programming language, program semantics are defined by a standardized set of keywords. HTML has a limited set of keywords (i.e., tags) and they are mainly for presentation purposes, not for semantics associated with document contents. For example, without human understanding or a sophisticated program, it is difficult to know what a number xe2x80x9c1991xe2x80x9d means in an HTML document; it could be a year, a quantity, or a word with some other meaning.
In response to growing concerns about HTML""s versatility, Extensible Markup Language (XML), which is a subset of Standard Generalized Markup Language (SGML), has been proposed to the World Wide Web Consortium (W3C) as the next standard format. XML is a meta language, allowing a user to design a customized markup language for many classes of structured documents. XML supports user-defined tags for better description of nested document structures and associated semantics, and encourages separation of document contents from browser presentation. For interoperability, domain-specific tags, called vocabulary, can be standardized, so that applications in that particular domain understand the meaning of the tags. Various vocabularies for different domains have been proposed in the SGML community, such as Electronic Data Interchange (EDI) for banking exchange, Standard Music Description Language (SMDL) for music, or Chemical Markup Language (CML) for chemistry. Recently, vocabularies have been proposed in the XML community, for example the Channel Definition Format (CDF) for channels. XML removes the dependence on a single, inflexible document type (i.e. HTML), while retaining the complexity of full SGML.
Structured documents are documents which have nested structures. XML documents are structured documents. The challenge has been to store, search, and retrieve these documents using the existing business database systems. Assuming a need to manage an abundance of XML documents, in particular within intranets (within a business) and extranets (between businesses), where documents are more likely to be regularly structured, there is clearly a need for a product that understands document structures and allows a user to store, search using structure queries, and retrieve XML documents within the database system.
Databases are computerized information storage and retrieval systems. A Relational Database Management System (RDBMS) is a database management system (DBMS) which uses relational techniques for storing and retrieving data. Relational databases are organized into tables, which consist of rows and columns of data. The rows are formally called tuples. A database will typically have many tables and each table will typically have multiple tuples and multiple columns. The tables are typically stored on direct access storage devices (DASD), such as magnetic or optical disk drives for semi-permanent storage.
Among existing products providing storage and retrieval capabilities are those distributed by Oracle(copyright) Corporation and POET Software Corporation. The Oracle 8i XML support system stores each element of an XML document in a different table within the database system. The software product marketed by POET Software Corporation breaks down XML documents into objects and stores them in an object-oriented database added to the existing database that a business uses. Both products create a burden on the management and maintenance of the database system. An application is needed that would efficiently use the existing resources of a business to store and retrieve XML documents.
With respect to search capabilities, current search engines either flatten out the structure of a document (i.e., remove the nested structures), or have limited, predefined structures (such as paragraphs and sentences, according to some predefined punctuation marks). Therefore, there also is a need for an application capable of evaluating general ad hoc structure queries.
To add even richer semantics to XML documents, proposals to W3C have suggested adding of data types into XML documents and associating these data types with XML elements and attributes. This implementation could allow users to ask xe2x80x9crange queriesxe2x80x9d requiring numeric value comparisons among elements in an XML document. These queries normally require B+ tree index structures residing in databases. However, processing of such queries is certainly beyond the capabilities of most information retrieval systems and search engines based on inverted indices and providing support to B+ tree index structures in these systems is very expensive. Therefore, there is a further need for an application that uses existing B+ tree index structures, already implemented in the database system to support indexes for structured documents with rich data types.
Several approaches have been adopted to solve this problem and to perform searches on structured documents with rich data types. For example, an alternative has been proposed to implement the B+ tree index structures inside the text search engine and then to perform the search. However, this approach is very expensive to implement. Another approach involves the creation of actual tables having columns storing attributes of XML documents. An index can be created on the columns and this index could support searches. This approach wastes space and cannot efficiently maintain the extra table. In yet another approach, the user creates an additional table, called a summary table, storing all attributes existent in the XML documents. Although the problem of maintaining the table is somewhat solved because the database manager usually maintains the summary table, the waste of space is still burdensome.
To overcome the limitations in the prior art described above, and to solve various problems that will become apparent upon reading and understanding of the present specification, it is one object of the present invention to provide a method, apparatus and article of manufacture for computer-implemented storage, searching, and retrieval of structured documents in a relational database system.
The present invention is directed to relational extenders for a computer-implemented relational database system. These relational extenders are entities created to help relational database users handle complex data types. Relational extenders define and implement new complex data types, storing the attributes, structure, and behavior of the data types in a column of a relational database table. The complex data types stored in relational databases support new applications to be run and/or extend existing business applications. Within the relational database system, these data types need to be manipulated through the standard Structured Query Language (SQL). As a result, relational extenders provide good management solutions for handling any type of data.
In accordance with the present invention, an XML extender for a computer-implemented relational database system is disclosed for storing, querying, and retrieving structured documents. Generally, relational extenders define and implement complex data types and extend the tables within the relational database with the new data types. The XML extender provides a new Abstract Data Type (ADT) DB2XML, which can be specified as a column data type, and includes several User Defined Functions (UDFs) for storing, searching, and retrieving XML documents internally, as DB2(copyright) Character Based Large Objects (CLOB), or externally, in flat files or Uniform Resource Locators (URLs), for example.
Another object of the present invention is to provide an application for storing XML documents in existent or newly created columns of a relational database table or in external files.
Yet another object of the invention is to provide an application for searching XML documents using SQL structure queries.
Still another object of the invention is to use such an application for searching the content and attribute values of elements of an XML document based on a specified sequence of such elements.
A further object of the invention is to use such an application for searching XML documents stored in external files or URLs as if they were stored in DB2(copyright).
Yet another object of the invention is to use such an application for retrieving XML documents by integrating structural search capabilities into DB2(copyright)""s SELECT queries.
Yet another object of the invention is to provide an application for creating and supporting an index for structured documents with rich data types using index structures residing in the DB2(copyright) database.
A further object of the invention is to provide an application for creating and supporting structure indexes for the XML extender using a tag counting system for counting and storing occurrences of elements of an XML document.