The present invention relates to methods and systems to index and search records stored in a language using nested fields, particularly those stored in the Extensible Markup language (XML). In particular, the present invention relates to an improved method and a computerized system to index and search documents and data in languages such as XML hat utilize nested fields.
The Extensible Markup Language (XML) is a universally accepted format for representing structured data in textual form. It is widely adopted in enterprise databases, and particularly in databases and applications connected to the World Wide Web. The manipulation and exchange of structured data, e.g., spreadsheets, address books, financial a transactions, technical drawings, etc., is often challenging as the data is traditionally represented in platform or program dependent document formats. XML provides a set of rules and guidelines for designing text formats for such data; these XML text formats are unambiguous, platform-independent, and extensible.
An example of a simple XML document is provided as follows:
Basic XML format includes tags with brackets, e.g.,  less than city greater than  begins a field and  less than /city greater than  ends a field. Thus,  less than city greater than  New York  less than /city greater than  represents a field named xe2x80x9ccityxe2x80x9d that contains the content xe2x80x9cNew York.xe2x80x9d Fields can be nested, e.g., xe2x80x9ccityxe2x80x9d is an element in the field xe2x80x9caddress,xe2x80x9d as shown above. More complex syntax can be used for various types of data.
A key practical issue in realizing advantages afforded by XML is the need for an efficient search method. Easy data manipulation and exchange requires an effective method to handle computational intensive search operations for complex and concurrent queries, which are becoming common place in the use of networked enterprise databases and databases connected to the Internet.
Existing database management systems, such as relational database and object-oriented database systems, are generally equipped with mechanisms or facilities for rapidly retrieving selected records based on key fields in the database. Such facilities or mechanisms often depend upon the data and the schema, and therefore are specific to each database. A variety of complex data structures are implemented in databases to facilitate fast retrieval of data based on key fields; examples include binary trees, B-trees, and red-black trees. Additionally, various types of indices are built for certain key words or fields that are frequently queried in a database to enable fast searching on those words and fields.
Existing full-text indices allow rapid searches on any word in a body of text. They are commonly used by Internet search engines such as Hotbot and Alta Vista to enable a user to quickly identify a particular Web site. Although they vary considerably in their implementation, full-text indices essentially consist of a table of words in alphabetical order, with pointers or links to the corresponding locations of the words in a database or a file. Generally a full-text index also supports wildcard (represented by xe2x80x9c*xe2x80x9d) searches that locate words based on a partial match. For example, a search for xe2x80x9cappl*xe2x80x9d will find xe2x80x9capply,xe2x80x9d xe2x80x9cappliance,xe2x80x9d etc.
Neither of these existing technologies provides an efficient way to search XML. Since XML represents structural data in a textual format, it lends itself only to a slow, sequential scan of the text in a search of a particular record. Standard full-text indexing provides only an incomplete solution because the field context of each word is not preserved. For example, a standard full-text index of the sample XML document above supports a search for xe2x80x9cSmith,xe2x80x9d but not for xe2x80x9cSmithxe2x80x9d only in the xe2x80x9caddressxe2x80x9d field. That is, one cannot locate an address with xe2x80x9cSmithxe2x80x9d in it using a full-index search; such a search will find all records in any field that has xe2x80x9cSmithxe2x80x9d in it. Some full-text indexing systems have the ability to search for a word associated with a particular property or field of a document (such as xe2x80x9cAuthor is John Smithxe2x80x9d), but this still does not provide a way to search based on the structural context of a word in an XML file, which involves several nested field qualifiers.
Therefore, much needed is an improved full text indexing mechanism for searching XML data, which is capable of distinguishing between xe2x80x9cSmithxe2x80x9d in the last_name field and xe2x80x9cSmithxe2x80x9d in the street field, or between xe2x80x9cNew Yorkxe2x80x9d in the city field and xe2x80x9cNew Yorkxe2x80x9d in the state field. Such a mechanism should also preserve information on nested fields, so that the street field is recognized as an element within the address field, and the last_name field is recognized as an element of the name field. The queries such as xe2x80x9caddress contains New Yorkxe2x80x9d (search for any record that contains New York in the address field or any field under the address field) and xe2x80x9caddress/city contains New Yorkxe2x80x9d (search for any record that contains New York in the city field that is part of an address field) should rapidly retrieve the qualified records using such an improved indexing and searching mechanism. To make fast and effective searches possible, certain external data structures need to be constructed to preserve the inherent structure information in the XML data and to provide a short cut to locate particular items.
However, the current state of the art only provides limited alternatives for indexing and searching XML data. One approach is to create separate indices for each sub-fields, which preserves the structural information of the data but drastically increases the overhead and therefore is not desirable. Another approach is to use a directed graph to represent the nested fields. (Goldman R. et al., Lore: a database management system for XML, 2000) The search through a directed graph can be extremely computationally intensive and costly as the complexity of the data, hence complexity of the graph, grows. Both approaches result in an index structure whose complexity is comparable with that of the XML data itself. A more efficient and cost-saving indexing and searching method is desired.
To resolve the above problems, the present invention is directed to an improved method and a computer system for indexing and searching records in a language utilizing nested fields, such as XML. The present invention discloses an indexing and searching engine that constructs an improved full-text search index on the input XML data and then performs searches using the index. The indexing and searching engine according to the preferred embodiment of this invention supports exact matches and partial matches using a wildcard character.
In accordance with one aspect of the present invention, the method transforms the problem of indexing and searching nested field records, including XML data, into the problem of full-text indexing and searching of plain text documents. The input XML data is changed into a form that encodes the field structural information by suffixing each word with its corresponding field qualifiers in their nested entirety, or alternatively, by suffixing each word with a numerical code pattern that represents the word""s corresponding field qualifiers in their nested entirety. The resulting encoded words are then stored in a full-text index structure.
In accordance with another aspect of the present invention, wildcard matching may be used to perform searches with or without field qualifiers. To search using a wildcard without field qualifiers allows identifying a record including a particular word regardless the field of the record, whereas to search using a wildcard with field qualifiers allows identifying a record including a particular word in a designated field or fields that share certain level of similarly nested structure.
In accordance with yet another aspect of the present invention, a combination of string matching and integer pattern matching is used in the search of a particular word. The portion of the word without field qualifiers is first matched against the words in the index, and then the word""s field qualifiers are transformed into a pattern of numerals, e.g., integers, to be matched against the integer patterns of the words in the index that correspond to their respective field qualifiers. Therefore, evaluation of complex field criteria is reduced to simpler and faster numeric matching.
The present invention with all aspects of a method and computer system for indexing and searching nested field records, such as XML data and documents, significantly improves the effectiveness and speed of the search, and hence facilitates full realization of advantages of XML as an extensible, portable data exchange format.
Further features, objects, and advantages of the present invention are apparent in the examples and in the detailed description that follows.