Definitions and abbreviations used herein are as follows:
Action—an instruction concerning what to do with some matched text.
Annotation Configuration—a file that identifies and orders the set of annotators that should be applied to some text for a specific application.
Annotations—attributes, or values, assigned to words or word groups that provide interesting information about the word or words. Example annotations include part-of-speech, noun phrases, morphological root, named entities (such as Corporation, Person, Organization, Place, Citation), and embedded numerics (such as Time, Date, Monetary Amount).
Annotator—a software process that assigns attributes to base tokens or to constituents or that creates constituents from patterns of one or more base tokens.
Attributes—features, values, properties or links that are assigned to individual base tokens, sequences of base tokens or related but not necessarily adjacent base tokens (i.e., patterns of base tokens). Attributes may be assigned to the tokenized text through one or more processes that apply to the tokenized text or to the raw text.
Auxiliary definition—in the RuBIE pattern recognition language, a statement or shorthand notation used to name and define a sub-pattern for use elsewhere.
Base tokens—minimal meaningful units, such as alphabetic strings (words), punctuation symbols, numbers, and so on, into which a text is divided by tokenization. Base tokens are the minimum building blocks for a text processing system.
Case-corrected—text in which everything is lower case except for named entities.
Constituent—a base token or pattern of base tokens to which an attribute has been assigned. Although constituents often consist of a single base token or a pattern of base tokens, a constituent is not necessarily comprised of contiguous base tokens. An example of a non-contiguous constituent is the two-word verb looked up in the sentence He looked the address up.
Constituent attributes—those attributes that are assigned to a pattern of one or more base tokens that represent a single constituent.
Label—an alphanumeric string that uniquely identifies a pattern recognition rule or auxiliary definition.
Machine learning-based pattern recognition—pattern recognition in which a statistic-based process might be given a mix of example texts that do and do not represent the targeted extraction result, and the process will attempt to identify the valid patterns that correspond to the targeted results.
Pattern—a description of a number of base tokens that should be recognized in some way, where the recognition of the tokens is primarily driven by targeted attributes that have been assigned to the text through annotation processes. One or more annotation value tests, zero or more recognition shifts, zero or more regular expression operators, and zero or more XPath-based (tree-based) operators may all be included in a pattern.
Pattern recognition language—a language used to guide a text processing system to find defined patterns of annotations. In its most common usage, a pattern recognition rule will test each constituent in some pattern for the presence or absence of one or more desired annotations (attributes). If the right combinations of annotations are found in the right order, the statement can then copy that text, add further annotations, or both, and return it to an application (that is, extract it) for further processing. Because linguistic relationships can involve constituents that are tree-structured or otherwise not necessarily sequentially ordered, a pattern recognition rule can also follow these types of relationships and not just sequentially arranged constituents.
Pattern recognition rule—a statement used to describe what text should be located by its pattern, and what should be done when such a pattern is found.
RAF—RuBIE application file.
RuBIE—Rule-Based Information Extraction language. The language in which the pattern recognition rules of the present invention are expressed.
RuBIE application file—a flat text file that contains one or more text pattern recognition rules and possibly other components of the RuBIE pattern recognition language. Typically it will contain all of the extraction rules associated with a single fact extraction application.
Rule-based pattern recognition—pattern recognition in which the pattern recognition rules are developed by a computational linguist or other pattern recognition specialist, usually through an iterative trial-and-error develop-evaluate process.
Shift—pattern recognition functionality that changes the location within a text where a pattern recognition rule is applying. Many pattern recognition languages have rules that process a text in left-to-right order. Shift functionality allows a rule to process a text in some other order, such as repositioning pattern recognition from mid-sentence to the start of a sentence, from a verb to its corresponding subject in mid-rule, or from any point to some other defined non-contiguous point.
Scope—the portion or sub-pattern of a pattern recognition rule that corresponds to an action. An action may act upon the text matched by the sub-pattern only if the entire pattern successfully matches some text.
Sub pattern—any pattern fragment that is less than or equal to a full pattern. Sub-patterns are relevant from the perspective of auxiliary definition statements and from the perspective of scopes of actions.
Tests—tests apply to constituents to verify either the value of a constituent or whether a particular attribute has been assigned to that constituent.
Text—in the context of a document search and retrieval application such as LexisNexis®, any string of printable characters, although in general a text is usually expected to be a document or document fragment that can be searched, retrieved and presented to customers using the online system. Web pages, customer documents, and natural language queries are other examples of possible texts.
Token—a minimal meaningful unit, such as an alphabetic string (word), space, punctuation symbol, number, and so on.
Token attributes—those attributes that are assigned to individual base tokens. Examples of token attributes may include the following: (1) part of speech tags, (2) literal values, (3) morphological roots, and (4) orthographic properties (e.g., capitalized, upper case, lower case strings).
Tokenize—to divide a text into a sequence of tokens.
Prior art pattern recognition languages and tools include lex, SRA's NetOwl® technology, and Perl™. These prior art pattern recognition languages and tools primarily exploit physical or orthographic characteristics of the text, such as alphabetic versus digit, capitalized vs. lower case, or specific literal values. Some of these also allow users to annotate pieces of text with attributes based on a lexical lookup process.
In the mid-1980s, the Mead Data Central (now LexisNexis) Advanced Technology & Research Group created a tool called the leveled parser. The leveled parser was an example of a regular expression-based pattern recognition language that used a lexical scanner to tokenize a text—that is, break the text up into its basic components (“base tokens”), such as words, spaces, punctuation symbols, numbers, document markup, etc.—and then use a combination of dictionary lookups and parser grammars to identify and annotate individual tokens and patterns of tokens of interest, based on attributes (“annotations” or “labels”) assigned to those tokens through the scanner, parser or dictionary lookup (a base token and patterns of base tokens that share some common attribute are called “constituents”).
For example, the lexical scanner might break the character string
I saw Mr. Mark D. Benson go away.
into the annotated base token pattern:
UCLCSCPSPERCPSUCPERCPSLCSLCSPERIsawMr.MarkD.Bensongoaway.(where UC = upper case letter, LCS = lower case string, CPS = capitalized string, PER = period).
A dictionary lookup may include a rule to assign the annotation TITLE to any of the following words and phrases: Mr, Mrs, Ms, Miss, Dr, Rev, President, etc. For the above example, this would result in the following annotated token sequence:
UCLCSTITLEPERCPSUCPERCPSLCSLCSPERIsawMr.MarkD.Bensongoaway.
A parser grammar was then used to find interesting tokens and token patterns and annotate them with an indication of their function in the text. The parser grammar rules were based on regular expression notation, a widely used approach to create rules that generally work from left to right through some text or sequence of annotated tokens, testing for the specified attributes.
For example, a regular expression rule to recognize people names in annotated text might look like the following:
(TITLE (PER)?)? (CPS | UCS) (UC (PER)? | CPS | UCS)?(CPS | UCS)
This rule first looks for TITLE attribute optionally (“?”) followed by a period (PER), although the TITLE or TITLE-PERIOD is also optional. Then it looks for either a capitalized (CPS) OR upper case (UCS) string. It then looks for an upper case letter (UC) optionally followed by a period (PER), OR it looks for a capitalized string (CPS), OR it looks for an upper case string (UCS), although like the title, this portion of the rule is optional. Finally it looks for a capitalized (CPS) OR upper case (UCS) string.
This rule will find Mr. Mark D. Benson in the above example sentences. It will also find names like the following:
Mark BensonMark D BensonMark David BensonMr. Mark Benson
However, it will not find names like the following:
MarkBensonGeorge H. W. Bushe. e. cummingsBill O'Reilly
Furthermore it will also incorrectly recognize a lot of other things as person names, such as Star Wars in the following sentence:
Mark saw Star Wars yesterday.
A grammar, whether a lexical scanner, leveled parser or any of the other conventional, expression-based pattern recognition languages and tools, may contain dozens, hundreds or even thousands of rules that are designed to work together for overall accuracy. Any one rule in the grammar may handle only a small fraction of the targeted patterns. Many rules typically are written to find what the user wants, although some rules in a grammar may primarily function to exclude some text patterns from other rules.
Regular expression-based pattern recognition works well for a number of pattern recognition problems in text. It is possible to achieve accuracy rates of 90%, 95% or higher for a number of interesting categories, such as company, people, organization and place names; addresses and address components; embedded numerics, such as times, dates, telephone numbers, weights, measures, and monetary amounts; and other tokens of interest such as case and statute citations, case names, social security numbers and other types of identification numbers, document markup, websites, e-mail addresses, and table components.
Regular expressions do have a problem recognizing some categories of tokens because there is little if any consistency in the structure of names in those categories, regardless of how many rules one might use. These include product names and names of books or other media, names that can be almost anything. There are also some language-specific issues that one runs into, for example: rules that recognize European language-based names in American English text often will stumble on names of Middle Eastern and Asian language origin; and rules developed to exploit capitalization patterns common in English language text may fail on languages with different capitalization patterns.
However, in spite of such problems, regular expression-based pattern recognition languages are widely used in a number of text processing applications across a number of languages.
What makes a text interesting is not that it contains just names, citations or other such special tokens, but that it also identifies the roles, functions, and attributes of those entities and their relationships with one another. These relationships are represented in text in any of a number of ways.
Consider the following sentences:
John kissed Mary.Mary was kissed by John.John only kissed Mary.John kissed only Mary.John, that devil, kissed Mary.John kissed an unsuspecting Mary.John snuck up behind Mary and kissed her.Mary was minding her own business when John kissed her.
And yet for all of these sentences, the fundamental “who did what to whom” relationship is John (who) kissed (did what) Mary (to whom).
When trying to exploit sophisticated linguistic patterns, regular expression-based pattern recognition languages that progress from left to right through a sentence can enjoy some success even without any sophisticated linguistic annotations like agent or patient, but only for those cases where the attributes of interest are generally adjacent to one another, as in the first two example sentences above that use simple active voice or simple passive voice—and little else—to express the relationship between John and Mary.
But this approach to pattern recognition soon falls apart with the addition of any linguistic complexity to the sentence, such as adding a word like only or pronoun references like her.
A system that would attempt to find and annotate or extract who did what to whom in the above sentences would need at least two rather sophisticated linguistics processes:                (1) The ability to identify and exploit agent-action-patient relationships in sentences or clauses (the reader may think of these in terms of subject-verb-object relationships, but agent-action-patient is more descriptive and useful given the existence of both active and passive sentences).        (2) The ability to link coreferring expressions, such as her to Mary in the above sentences, and exploit those links.        
This type of functionality is fundamentally beyond the scope of regular expression-based pattern recognition languages.
Orthographic attributes that are assigned to texts or text fragments are attributes whose assignment is based on attributes of the characters in the text, such as capitalization characteristics, letters versus digits, or the literal value of those characters.
Regular expression-based pattern recognition rules applied to the characters in a text are quite useful for tokenizing a text into its base tokens and assigning orthographic annotations to those tokens, such as capitalized string, upper case letter, punctuation symbol or space.
Regular expression-based pattern recognition rules applied to base tokens are quite useful for combining base tokens together into special tokens such as named entities, citations, and embedded numerics. These types of rules also assign orthographic annotations.
A dictionary lookup may be used to assign orthographic, semantic, and other annotations to a token or pattern of tokens. In an earlier example, a dictionary was used to assign the attribute TITLE to Mr. Some dictionary lookup processes at their heart rely on regular expression-based rules that apply to character strings, although there are other approaches to do this.
Semantic annotations can tell us that something is a person name or a potential title, but these types of annotations do not indicate the function of that person in a document. John may be a person name, but that does not tell us if John did the kissing or if he himself was kissed.
Linguists create parsers to help determine the natural language syntax of sentences, sentence fragments, and other texts. This syntax is both not only interesting in its own right for the linguistic annotations it provides, but also because it provides a basis for addressing ever more linguistically sophisticated problems. Identifying clauses, their syntactic subjects, verbs, and objects, and the various types of pronouns provides a basis for determining agents, actions, and patients in those clauses and for addressing some types of coreference resolution problems, particularly those involving linking pronouns to names and other nouns.
One typical characteristic of parser-based text annotations is that the annotations are usually represented by a tree or some other hierarchical representation. A tree is useful for representing both simple and rather complex syntactic relationships between tokens.
One such tree representation for John kissed Mary is shown in FIG. 1.
Parse trees not only annotate a text with syntactic attributes like Noun Phrase or Verb, but through the relationships they represent, it is possible to derive additional grammatical roles as well as semantic functions. For example,                A Noun Phrase found immediately under a Sentence node in such a tree may be annotated as the Grammatical Subject.        Depending on its content and location relative to the verb, a Noun Phrase found immediately under a Verb Phrase may be annotated as the Grammatical Object.        If the Verb in this Sentence is an active verb, then the Grammatical Object may be annotated with Patient as its semantic function. If the Verb is passive, then the Grammatical Subject may instead be annotated as the patient.        
As sentences grow more complex, the process for annotating the text with these attributes also grows more complex—just as is seen with regular expression-based rule sets that target people names or other categories. But in general, many relationships between constituents of the tree can be defined by descriptions of their relative locations in the structure.
Through tokenization, dictionary lookups and parsing, it is possible for a part of the text to have many annotations assigned to it.
In the above sentence, the token Mary may be annotated with several attributes, such as the following:
Literal value “Mary”Morphological root “Mary”Quantity SingularCapitalized StringAlphabetic StringProper NounPerson NameGender FemaleNoun PhraseGrammatical Object of Verb “Kiss”Patient of Verb “Kiss”Part of Verb Phrase “Kissed Mary”Part of Sentence “John Kissed Mary”
The tree representation of FIG. 1 can capture all of these attributes, as shown in FIG. 2.
The hierarchical relationships represented by a tree can be represented through other means. One common way is to represent the hierarchy through the use of nested parentheses. A notation like X(Y), for example, could be used to annotate whatever Y is with the structural attribute X. Using the above example, ProperNoun (John) indicates that John is a constituent under Proper Noun in the tree. Using this notation, the whole sentence would look like the following:
Sentence( NounPhrase( ProperNoun( John ) ),VerbPhrase( Verb( kissed ), NounPhrase( ProperNoun(Mary ) ) ) )
Often with this type of representation, the hierarchy can be made more apparent through the use of new lines and indentation, as the following shows:
Sentence(NounPhrase(ProperNoun(John ) ),VerbPhrase(Verb(kissed ),NounPhrase(ProperNoun(Mary ) ) ) )
The difference is purely cosmetic; the use of labels and parentheses is identical.
In computing, there are now a number of widely used approaches for annotating a text with hierarchy-based attributes. SGML, the Standard Generalized Markup Language, gained widespread usage in the early 1990s. HTML, the HyperText Markup Language, is based on SGML and is used to publish hypertext documents on the World Wide Web.
In 1998, XML, the Extensible Markup Language, was created. Since it was introduced in 1998, it has gained growing acceptance in a number of text representation problems, many of which are geared towards representing the content of some text—a document—in a way that makes it easy to format, package, and present that text in any of a number of ways. XML is increasingly being used as a basis for representing text that has been annotated for linguistic processing. It has also emerged as a widely used standard for defining specific markup languages for capturing and representing document structure, although it can be used for any structured content.
The structure of a news article may include the headline, byline, dateline, publisher, date, lead, and body, all of which fall under a document node. A tree representation of this structure might look as shown in FIG. 3.
Just as XML can be used to define a news document markup, it can be used to define the type of linguistic markup shown in the John kissed Mary example above.
The notation for XML markup uses a label to mark the beginning and end of the annotated text. Where X (Y) is used above to represent annotating the text Y with the attribute X, XML uses the following, where <X> and </X> are XML tags that annotate text Y with X:
<X>Y</X>
The John kissed Mary example would look like:
<Sentence><NounPhrase><ProperNoun>John</ProperNoun></NounPhrase><VerbPhrase><Verb>kissed</Verb><NounPhrase><ProperNoun>Mary</ProperNoun></NounPhrase></VerbPhrase></Sentence>or cosmetically printed as:
<Sentence><NounPhrase><ProperNoun>John</ProperNoun></NounPhrase><VerbPhrase><Verb>kissed</Verb><NounPhrase><ProperNoun>Mary</ProperNoun></NounPhrase></VerbPhrase></Sentence>
The elements of the XML representation correspond to the nodes in the tree representation here. And just as attributes can be added to the nodes in the tree, such as +Object, +Patient and Literal “Mary” were added to the tree in FIG. 2, attributes can be associated with XML elements. Attributes in XML provide additional information about the element or the contents of that element.
For example, it is possible to associate attributes with the Proper Noun element “Mary” found in the tree above in the following way in an XML element:
<ProperNoun LITERAL=“Mary” NUM=“SINGULAR”ORTHO=“CPS”TYPE=“ALPHA” PERSON=“True”GENDER=“Female”>Mary</ProperNoun>
In computational linguistics, trees are routinely used to represent both syntactic structure and attributes assigned to nodes in the tree. XML can be used to represent this same information.
Finding related entities/nodes in trees and identifying the relationships between them primarily rely on navigating the paths between these entities and using the information associated with the entities/nodes. For example, as discussed above, this information could be used to identify grammatical subjects, objects and the relationship (in that case the verb) between them.
Linguists historically have used programming languages like Lisp to create, annotate, analyze, and navigate tree representations of text. XPath is a language created to similarly navigate XML representations of texts.
XSL is a language for expressing stylesheets. An XML style sheet is a file that describes how to display an XML document of a given type.
XSL Transformations (XSLT) is a language for transforming XML documents, such as for generating an HTML web page from XML data.
XPath is a language used to identify particular parts of XML documents. XPath lets users write expressions that refer to elements and attributes. XPath indicates nodes in the tree by their position, relative position, type, content, and other criteria. XSLT uses XPath expressions to match and select specific elements in an XML document for output purposes or for further processing.
When linguistic trees are represented using XML-based markup, XPath and XPath-based functionality can serve as a basis for processing that representation much like linguists have historically used Lisp and Lisp-based functionality.
Most work in information extraction research with which the inventors are familiar has focused on systems where all of the component technologies were created or adapted to work together. Base token identification feeds into named entity recognition. Named entity recognition results feed into a part of speech tagger. Part of speech tagging results feed into a parser. All of these processes can make mistakes, but because each tool feeds its results into the next one and each tool generally assumes correct input, errors are often built on errors.
In contrast, where annotation processes come from multiple sources and are not originally designed to work together, they do not necessarily build off each other's mistakes. Instead, their mistakes can be in conflict with one another.
For example, a named entity recognizer that uses capitalization might incorrectly include the capitalized first word of a sentence as part of a name, whereas a part of speech tagger that relies heavily on term dictionaries may keep that first word separate. E.g.,
Original text:Did Bill     go to the store?Named entity:[ Person    ]Part of Speech:[AUX] [ProperNoun]
This can be an even bigger problem if two annotators conflict in their results at both the beginning and the end of the annotated text string. For example, for the text string A B C D E, assign tag X to A B C and Y to C D E as shown in FIG. 4. In an XML representation, one possible end result is:
<X> A B <Y> C </X> D E </Y>
For example, if a sentence mentions a college and its home state, “ . . . University of Chicago, Illinois . . . . ”, then overlapping annotations for Organization and City may result:
<Organization> University of <City> Chicago,</Organization> Illinois </City>
Well-formed XML has a strict hierarchical syntax. In XML, marked sub-pieces of text are permitted to be nested within one another, but their boundaries may not cross. That is, they may not have overlapping tags (HTML, the markup commonly used for web pages, does permit overlapping tags.) This typically is not a problem for most XML-based applications, because the text and their attributes are created through guidance from valid document type definitions (DTDs). Because it is possible to incorporate annotators that were not designed to some common DTD, annotators can produce conflicting attributes. For that reason the RuBIE annotation process needs a component that can combine independently-generated annotations into valid XML.
Further, our past experiences with prior pattern recognition tools showed a great deal of value for both the use of regular expressions and tree-traversal tools, depending on the application. Tools such as SRA NetOwl® Extractor, Inxight Thingfinder™, Perl™, and Mead Data Central's Leveled Parser all provide “linear” pattern recognition, and tools such as XSLT and XPath provide hierarchical tree-traversal. However, we did not find any pattern recognition tool that combined these, particularly in a way appropriate for XML-based document representations. The typical representation to which regular expressions usually apply do not have a tree structure, and thus is not generally conducive to tree traversal-based functionality. Whereas tree representations are natural candidates for tree traversal functionality, their structure is not generally supportive of regular expressions.
The Penn Tools, an information extraction research prototype developed by the University of Pennsylvania, combine strong regular expression-based pattern recognition functionality with what on the surface appeared to be some tree navigation functionality. However, in that tool, only a few interesting types of tree-based relationships were retained. These were translated into a positional, linear, non-tree representation so that their regular expression-based extraction language, Mother of Perl (“MOP”), could also apply to those relationships in its rules. The Penn Tools information extraction research prototype does not have the ability to exploit all of the available, tree-based relationships in combination with full regular expression-based pattern recognition.
It is to the solution of these and other objects to which the present invention is directed.