1. Field of the Invention
This invention relates in general to computer-implemented database systems, and, in particular, to a method and apparatus for creating an index in a database system for efficient execution of structured queries.
2. Description of the Related Art
HyperText Markup Language (HTML) has been the standard format for delivering information on the World Wide Web (WWW). However, HTML has only a limited set of tags for specifying document structures, and these tags are mainly for the purposes of browser presentation. Automated information processing on these documents for data exchange and interoperability has been difficult. Extensible Markup Language (XML), which is a subset of Standard Generalized Markup Language (SGML), has been proposed to be the next standard format that allows user-defined tags for better describing nested document structures and associated semantics.
People are still learning how to use effectively the flood of information available on the Internet and intranets. Thanks to current search engines, queries may yield very extensive results that may contain the needed information from sites all over the world. There are several main functions in a search engine: information gathering, indexing, categorization, and searching. Information gathering usually uses Web crawlers to send visited pages to the index engine. The index mechanism normally uses some form of inverted files and, when given a word, returns a list of references that contain the word. Categorization, or clustering, tries to categorize the pages according to some attributes, such as topics. The searching function allows the user to ask content-based queries and get ranked result sets.
While HTML documents serve very well for Web browsing, automated information processing on them could be difficult, because there are few semantics associated with the documents. For example, without human understanding or a sophisticated program, it is difficult to know what a number "1991" means in an HTML document; it could be a year, a quantity, or anything. Just as in a programming language, program semantics are defined by a standardized set of keywords. HTML has a limited set of keywords (i.e., tags) and they are mainly for presentation purposes, not for semantics associated with document contents.
To be able to automate Web information processing and in particular for data exchange and interoperability, XML has been proposed to the World Wide Web Consortium (W3C) as a new markup language that supports user-defined tags, and encourages the separation of document contents from presentation. XML is a meta language that allows the user to define a language for composing structured documents. With XML, the user can define any desired tags for better structuring of documents (although adding misleading tags is also possible). For interoperability, domain-specific tags, called vocabulary, can be standardized, so that applications in that domain understand the meaning of the tags. Various vocabularies for different domains have been proposed in the SGML community, such as Electronic Data Interchange (EDI) for banking exchange, Standard Music Description Language (SMDL) for music, or Chemical Markup Language (CML) for chemical. Recently, vocabularies have been proposed in the XML community, for example the Channel Definition Format (CDF) for channel.
Structured documents refer to documents that can have nested structures. Assuming structured documents will be abundant, in particular within intranets and extranets (between businesses), where documents are more likely to be regularly structured, there is clearly a need for a search engine that understands document structures and allows a user to ask structured queries. Current search engines either flatten out the structure of a document (i.e., remove nested structures), or have limited, predefined structures (such as paragraphs and sentences, according to some predefined punctuation marks), and thus are not capable of evaluating general ad hoc structured queries. Structured documents also enable comparisons among numeric values, for example, to get the references published after year 1991 from a structured paper (which is not possible with an inverted file based search engine).
A successful search engine for a large repository of structured documents relies on good indexing schemes. Therefore, there is a need in the art for designing indexes that support structured queries and execute the queries without resorting to the structured documents.