XML, or eXtensible Markup Language, provides additional features over HTML (Hyper-Text Markup Language). XML allows data to be put in various contexts, by allowing specific markup commands and descriptors to be created for specific data. In contrast, HTML only uses a limited number of set markup commands, which are primarily used to affect the look and positioning of text in a document. With XML, the idea is to make data self-identifying, by associating descriptive markup commands, also known as meta tags or meta words, with the data. For example, an entry of <patient-name>NAME</patient-name> in a record would be recognizable as a patient's name. A medical computer receiving a document could be programmed to search for a patient's name by looking for the meta word <patient-name>, and then, for example, adding or updating the associated patient's name in its database as appropriate.
One of the basic uses of this type of functionality is to associate text with the type of structure, or field, in which it is found—for example, a title, abstract, body, paragraph, table, list, and the like. By allowing these associations, complex text structures having multiple levels can be achieved. For example, portions of text can be associated with meta words indicating that text is found within a paragraph that is within a list that is within another paragraph. One advantage of associating text with fields is that searches for terms within specific types of fields can be done easily and quickly. Because the beginning and ending of each paragraph, title, abstract, and the like are stored as meta words in each document, it is easy to quickly find instances of terms that fall within, for example, a title. This is a powerful tool for conducting searches, and can be expected to play a larger role in future search engines for the World Wide Web. In general, commercially-available search engines for the Web do not currently search on meta words stored in documents written in XML.
A typical search engine is AltaVista's ni2, described in part by U.S. Pat. No. 5,832,500, the contents of which are hereby incorporated by reference. The ni2 search engine searches an index created from a database of records, and has the ability to search for words and meta words. A typical index has entries for each indexable word and meta word in a document, together with the associated locations for each word and meta word.
One particular problem that arises from searching for text associated with certain meta words occurs when text contains multiple fields of the same type that overlap or enclose each other. As an example of this type of problem, consider a simple example of a search conducted on a document that has the following structure, where meta words are in brackets and “Par” means paragraph:<ParBegin><ParBegin><ParEnd>Blue<ParEnd>
If a search engine were queried as to whether this document has an instance of the word “Blue” within a paragraph, it would first find an example of a ParEnd meta word just past the location of the word Blue, and assume this is a meta word representing the end of a paragraph field. It would then search the locations immediately preceding the last ParEnd, revealing another ParEnd meta word just before the word Blue. Continuing toward the front of the document, the search engine would next identify an instance of a ParBegin meta word. The search engine might associate this ParBegin meta word with the ParEnd meta word adjacent to it, on the left side of Blue, and thereby inaccurately report back that Blue is not within a paragraph. Or, the search engine might not know which of the two ParEnd meta words correlates to the ParBegin meta word, and return an error. Or, it might assume that one of the two ParEnd meta words is a mistake and, not knowing which is which, also report an error back under these circumstances.
The problem in the example is that the search engine does not know which ParEnd meta word is associated with the ParBegin meta word closest to the word Blue. In reality, the two ParEnd meta words are at different “nesting levels.” The first ParBegin and the last ParEnd in the document constitute one paragraph field, which is at a predetermined nesting level—for example, nesting level zero. Then, nested within this first paragraph field at nesting level zero is another paragraph field at nesting level one, bounded by the ParBegin and ParEnd meta words in the middle of the document. In any given document, there may be many different nesting levels of many different types of fields.
One way of overcoming the problem of nested fields is, when creating the index, to parse the document into each separate field, and to index separately all the text stored within each field. However, this leads to duplication, because fields may overlap and different fields will then contain the same text. Thus, this solution is expensive in terms of data storage requirements, as well as time-consuming for indexing and searching purposes. Another approach is to have the search engine, when searching an index for a particular field, not jump around but instead read the index sequentially from the start of a document, keeping track of the various fields, both the start and end point, as they appear. However, this is inefficient and requires significant computational resources to process queries. A more pragmatic solution is simply to disallow searching on fields within fields of the same name—a tact taken by the current version of ni2.
Thus, it would be desirable to provide a method for indexing a database of documents that contains entries having nesting level information associated with the meta words so that fields nested within fields could be quickly and effectively searched.
It would also be desirable to provide a method for indexing a database of documents that stores easily searchable meta words and associated nesting level information with minimal duplication, thereby minimizing the need for valuable memory resources.
Additionally, it would be desirable to provide a method for searching a database of documents that can identify, quickly and effectively, the nesting levels of the meta words closest to the text desired to be searched.
It would further be desirable to provide a method for searching a database of documents that can search all nesting levels of a particular type of field in a document in a sequential and direct manner.