1. Field of the Invention
The invention generally relates to arrangements for phrase matching in documents. More particularly, the invention relates to arrangements for phrase matching in documents that may contain nested-structure, arbitrary (document-specific) markup (including nested tags such as nested context, annotations, and the like).
2. Related Art
Extensible Markup Language (XML), and its ancestor Standard Generalized Markup Language (SGML), were originally developed by the document processing community for adding both structural and semantic markup to texts. XML's markup is considered “arbitrary” in that a document creator may define markup tags on a document-specific basis. Further, XML's markup structure may be “nested” in that context, annotations and the like may nested within other context, annotations and the like, and within each other.
Text sources such as Shakespeare's plays have been augmented to include markup describing scenes, speeches, and speakers (J. Bosak, The plays of Shakespeare in XML at the oasis-open.org web site). Classical literature abounds in commentaries added by literary critics (e.g., the Talmud contains commentaries on Biblical text). XML permits such commentaries to be easily identified via user-defined annotations. As a more recent example, the XML documents published by the Library Of Congress (LOC) contain the large texts of legislative bills; in these texts, the names of the sponsors of a bill and the committees to which a bill is referred are identified in the body of the bill with markup. XML can also be used to represent the output of natural-language processing systems; such output labels the grammatical structure of natural language text, for example, with subjects and verbs, and noun and adjective phrases (M. Marcus et al. Treebank-2, LDC catalog no. LDC95T7 (CD-ROM) Philadelphia: Linguistic Data Consortium, 1999).
In the absence of markup, phrase matching is a common technique to search text and identify relevant documents. Conventional phrase matching typically requires that words in a phrase be contiguous or in close proximity. For example, searching for the phrase “To be, or not to be” would return very different results than searching for the same set of words as individual keywords. Most information retrieval (IR) systems support phrase matching on text and on HTML documents, as they adopt the simple but effective expedient of ignoring universally recognized HTML tags.
However, XML provides nested arbitrary (document-specific) markup, including context and annotations. Accordingly, techniques of merely ignoring universally-recognized tags cannot be successfully extended to searching XML documents. Thus, there is a need in the art to provide a searching arrangement allowing one to specify which individual tags and complete annotations (i.e., elements and their content) to ignore.
For example, consider a phrase match query “Mr. English introduced this bill” in the XML document fragment in Table I, in which the query phrase is emphasized for purposes of discussion:
TABLE IXML Document fragment (noncontiguousquery phrase emphasized)<sponsor>Mr. English</sponsor><footnote>For himself and<co-sponsor>Mr. Coyne </co-sponsor></footnote>introduced this bill, which was referred to the<committee-name>Committee on Financial Services</committee-name>
The phrase being sought is not contiguous: the words “English” and “introduced” are separated by:
a </sponsor> end tag,
a <footnote> </footnote> complete annotation, and
a <co-sponsor> </co-sponsor> complete annotation embedded within the footnote.
To properly respond to this query on this document fragment, it is necessary to ignore the </sponsor> end tag, and the entire <footnote> </footnote> annotation.
Specifying that the <co-sponsor> tag should be ignored does not change the result because ignoring the <footnote> </footnote> annotation implicitly causes the <co-sponsor> tag to be ignored. However, not specifying that the </sponsor> end tag should be ignored, does change the result.
While phrase matching in general is a common information retrieval (IR) technique to search text and identify relevant documents in a document collection, customized phrase matching required to search documents with markup such as XML's is not supported by conventional IR systems. Text may be interleaved with arbitrary and nested markup, thwarting search techniques that require strict contiguity or close proximity of keywords. Phrase matching in XML and similar languages having nested-structure document-specific markup presents new challenges for phrase matching. Thus, there is a need in the art for a technique for phrase matching in nested-structure document-specific markup languages that permits dynamic specification of both the phrase to be matched and particular markup, especially document-specific markup, to be ignored.