The Extensible Markup Language (XML) is a universally accepted format for representing structured data in textual form. It is widely adopted in enterprise databases, and particularly in databases and applications connected to the World Wide Web. The manipulation and exchange of structured data, e.g., spreadsheets, address books, financial transactions, technical drawings, etc., is often challenging as the data is traditionally represented in platform or program dependent document formats. XML provides a set of rules and guidelines for designing text formats for such data; these XML text formats are unambiguous, platform-independent, and extensible.
An example of a simple XML document is provided as follows:
<record><name><first_name>John</first_name><last_name>Smith</last_name></name><address><street>123 Smith Drive</street><city>New York</city><state>New York</state></address></record>
Basic XML format includes tags with brackets, e.g., <city> begins a field and </city> ends a field. Thus, <city> New York </city> represents a field named “city” that contains the content “New York.” Fields can be nested, e.g., “city” is an element in the field “address,” as shown above. More complex syntax can be used for various types of data.
A key practical issue in realizing advantages afforded by XML is the need for an efficient search method. Easy data manipulation and exchange requires an effective method to handle computational intensive search operations for complex and concurrent queries, which are becoming common place in the use of networked enterprise databases and databases connected to the Internet.
Existing database management systems, such as relational database and object-oriented database systems, are generally equipped with mechanisms or facilities for rapidly retrieving selected records based on key fields in the database. Such facilities or mechanisms often depend upon the data and the schema, and therefore are specific to each database. A variety of complex data structures are implemented in databases to facilitate fast retrieval of data based on key fields; examples include binary trees, B-trees, and red-black trees. Additionally, various types of indices are built for certain key words or fields that are frequently queried in a database to enable fast searching on those words and fields.
Existing full-text indices allow rapid searches on any word in a body of text. They are commonly used by Internet search engines such as Hotbot and Alta Vista to enable a user to quickly identify a particular Web site. Although they vary considerably in their implementation, full-text indices essentially consist of a table of words in alphabetical order, with pointers or links to the corresponding locations of the words in a database or a file. Generally a full-text index also supports wildcard (represented by “*”) searches that locate words based on a partial match. For example, a search for “appl*” will find “apply,” “appliance,” etc.
Neither of these existing technologies provides an efficient way to search XML. Since XML represents structural data in a textual format, it lends itself only to a slow, sequential scan of the text in a search of a particular record. Standard full-text indexing provides only an incomplete solution because the field context of each word is not preserved. For example, a standard full-text index of the sample XML document above supports a search for “Smith,” but not for “Smith” only in the “address” field. That is, one cannot locate an address with “Smith” in it using a full-index search; such a search will find all records in any field that has “Smith” in it. Some full-text indexing systems have the ability to search for a word associated with a particular property or field of a document (such as “Author is John Smith”), but this still does not provide a way to search based on the structural context of a word in an XML file, which involves several nested field qualifiers.
Therefore, much needed is an improved full text indexing mechanism for searching XML data, which is capable of distinguishing between “Smith” in the last_name field and “Smith” in the street field, or between “New York” in the city field and “New York” in the state field. Such a mechanism should also preserve information on nested fields, so that the street field is recognized as an element within the address field, and the last_name field is recognized as an element of the name field. The queries such as “address contains New York” (search for any record that contains New York in the address field or any field under the address field) and “address/city contains New York” (search for any record that contains New York in the city field that is part of an address field) should rapidly retrieve the qualified records using such an improved indexing and searching mechanism. To make fast and effective searches possible, certain external data structures need to be constructed to preserve the inherent structure information in the XML data and to provide a short cut to locate particular items.
However, the current state of the art only provides limited alternatives for indexing and searching XML data. One approach is to create separate indices for each sub-fields, which preserves the structural information of the data but drastically increases the overhead and therefore is not desirable. Another approach is to use a directed graph to represent the nested fields. (Goldman R. et al., Lore: a database management system for XML, 2000) The search through a directed graph can be extremely computationally intensive and costly as the complexity of the data, hence complexity of the graph, grows. Both approaches result in an index structure whose complexity is comparable with that of the XML data itself. A more efficient and cost-saving indexing and searching method is desired.