The inventions herein relate to systems and methods for desired information located within one or more text documents. More particularly, the inventions relate to systems and methods which permit rapid, resource-efficient searches of natural language documents in order to locate pertinent documents and passages based on the role(s) of the user""s search term.
In order to facilitate discussion of the prior art and the inventions with precision, the terms below are defined for the reader""s convenience.
Glossary
Information Retrieval (IR)xe2x80x94The task of searching for textual information that matches a user""s query from a set of documents.
Information Extraction (IE)xe2x80x94The task of identifying very specific elements, defined by a user, in a text. Often, this is the process of answering the questions who, what, where, when, how, and why. For example, a user might be interested in extracting the names of companies that produce software and the names of those software packages. Information Extraction is distinct from Information Retrieval because 1) IE looks for specific information within a document rather than returning an entire document, and 2) an IE system is preprogrammed for these specifications while an IR system must be general enough to respond to any user query.
Relevancexe2x80x94A document is relevant if it matches the user""s query.
Recallxe2x80x94A measure of performance. Given the total number of documents relevant to a user""s query, recall is the percentage of that number that the system returned as relevant. For example, if there are 500 documents that match a user""s query, but the IR system only returns 50 relevant documents, then the system has demonstrated 10% recall.
Precisionxe2x80x94A measure of performance. Given the total number of documents truly relevant to a user""s query, precision is the percentage of the returned documents that were truly relevant. For example, if the IR system returned 50 documents, but only 25 of them matched the query, the system has demonstrated 50% precision.
Syntactic Rolesxe2x80x94The subject, direct object, and indirect object of a clause. Although not strictly a syntactic role, we also include the type of verb phrase (active-voice, passive-voiced, middle-voiced, infinitive) in this group.
Conceptual Rolesxe2x80x94Conceptual roles are a way of identifying the particular players within an action or event without regard to the syntax of the clause in which the action or event occurs. Consider the following two sentences.
1. The boy purchased an ice cream cone.
2. An ice cream cone was purchased by the boy.
In the first sentence, the subject is the purchaser and the direct object is the item that was purchased. In the second sentence, however, the subject is now the thing that was purchased and the purchaser is the object of the prepositional phrase introduced by xe2x80x9cby.xe2x80x9d The xe2x80x9cpurchaserxe2x80x9d and xe2x80x9cpurchased objectxe2x80x9d represent conceptual roles because they correspond to specific participants in a purchasing event. As evidenced by these two sentences, conceptual roles can appear in different locations within a sentence""s syntactic structure. The advantage of using conceptual roles for information extraction over syntactic roles is that a system can extract the participants of an event regardless of the particular syntax of the sentence.
Theta Rolesxe2x80x94Theta roles (also called thematic roles) are similar to conceptual roles in that they correspond to the participants of events or actions. In contrast to conceptual roles, the set of theta roles as defined herein is relatively constrained to include actors (who perform actions), objects or recipients (who receive action), experiencers (actors which play a role but receive no action directly), instruments (used to perform an action), dates (when an action occurred) and locations (where an action occurred). The set of conceptual roles, however, is not constrained. Conceptual roles can be defined to be appropriate to a particular task or collection of texts. In terrorism texts, for example, we may want to define the conceptual roles of perpetrator and victim, while in corporate acquisition texts we may want to define the conceptual roles of purchaser, purchasee, and transaction amount.
Syntactic Caseframexe2x80x94An extraction pattern based purely on syntactic roles, e.g. xe2x80x9cSUBJ  less than active-voice:kidnap greater than xe2x80x9d would extract the subject of any active-voice construction of the verb xe2x80x9cto kidnap.xe2x80x9d
Caseframexe2x80x94synonymous with syntactic caseframe.
Theta Caseframexe2x80x94A caseframe based on theta roles (often called conceptual roles) rather than syntactic roles, e.g. xe2x80x9cAGENT  less than verb:purchase greater than xe2x80x9d or xe2x80x9cOBJECT  less than verb:purchase greater than .xe2x80x9d
Morphological Root Formxe2x80x94The original form of a word once suffixes and prefixes have been removed, e.g. verb conjugations reduced to the raw verb form: xe2x80x9creportedxe2x80x9d and xe2x80x9creportingxe2x80x9d are both forms of xe2x80x9creport.xe2x80x9d
Associative Modelxe2x80x94The traditional approach to recognizing meaning in text. This model recognizes that certain words in association with each other generate meaning. For example, the terms xe2x80x9cheadquarters,xe2x80x9d xe2x80x9csmoke,xe2x80x9d xe2x80x9calarmxe2x80x9d and xe2x80x9csirenxe2x80x9d appear to generate the concept of a headquarters building on fire even though the term xe2x80x9cfirexe2x80x9d does not occur. Compare this approach to the Relational Model below.
Relational Modelxe2x80x94An approach to recognizing meaning in text that takes advantage of the relationships between words. For example, the following three phrases each generate a different meaning: xe2x80x9cheadquarters on fire,xe2x80x9d xe2x80x9cheadquarters under firexe2x80x9d and xe2x80x9cfire headquarters.xe2x80x9d The key to recognizing the distinction among these phrases is to recognize the relationship between xe2x80x9cheadquartersxe2x80x9d and xe2x80x9cfire.xe2x80x9d
Relational Text Index (RTI)xe2x80x94The final output which may be generated when using the invention. This is an index of events, relationships, the participants in those events or relationships, along with which document and sentence they occurred in.
Meta-type: A way of collecting specific conceptual types into a more general type. For example, if a verb normally represents a particular action, then a meta-type can be a group of verbs that could be considered synonymous. For example, the verbs xe2x80x9cto think,xe2x80x9d xe2x80x9cto believe,xe2x80x9d xe2x80x9cto understandxe2x80x9d could be considered to be somewhat synonymous, and as verbs of cognition, they give rise to the meta-type xe2x80x9cCognitive-action.xe2x80x9d Meta-types do not necessarily imply a two-level classification scheme. More than one meta-type may be combined into a single, more general meta-type. The meta-type, xe2x80x9cmovement-actionxe2x80x9d contains the meta-types xe2x80x9ctransportation-actionxe2x80x9d and xe2x80x9cphysical-movement-actionxe2x80x9d in which the former includes xe2x80x9cto flyxe2x80x9d and xe2x80x9cto drivexe2x80x9d while the latter includes xe2x80x9cto walk,xe2x80x9d xe2x80x9cto runxe2x80x9d and xe2x80x9cto crawl.xe2x80x9d Meta-types, therefore, represent nodes in a hierarchy of semantically related words in which each meta-type node must have at least two children. Note that common examples of non verb-based meta-types include grouping semantically related nouns or noun phrases together to include collections of dates, times, and locations.
Morphological Root Formxe2x80x94The original form of a word once suffixes and prefixes have been removed, e.g. verb conjugations reduced to the raw verb form: xe2x80x9creportedxe2x80x9d and xe2x80x9creportingxe2x80x9d are both forms of xe2x80x9creport.xe2x80x9d
POWERDRILLxe2x80x94A particular system that implements some of the inventions herein for information retrieval.
With the terms defined in the glossary above in mind, a discussion of the typical prior art keyword-based information retrieval systems and their weaknesses will be more meaningful.
Discussion of Prior Art
Traditional methods for information retrieval are based on an associative model of recognizing meaning in text. Associative models identify concepts by measuring how often particular terms occur in a specific document compared to how often they occur in general. In practice, this typically means means that such systems record the content of a document by recognizing which words appear within the document along with their frequency. Essentially, a standard information retrieval system will count how often each English word occurs in a particular document. This information is then saved in a matrix, or table, indexed by the word and document name. Such a table is depicted in FIG. 1 for the search term xe2x80x9cNow is the time for all good men to come to the air of their country.xe2x80x9d
In a typical keyword-based information retrieval system, the table of FIG. 1 would contain a column for each document in the searchable database, and a row for every English word. Since the number of English words can be enormous, many information retrieval systems reduce the number of distinct words they recognize by removing common prefixes and suffixes from words. For example, the words xe2x80x9cengine,xe2x80x9d xe2x80x9cengineer,xe2x80x9d xe2x80x9creengineerxe2x80x9d and xe2x80x9cengineeringxe2x80x9d may be stemmed as instances of xe2x80x9cenginexe2x80x9d to save space. In addition, many information retrieval systems ignore commonly occurring words like xe2x80x9cthexe2x80x9d xe2x80x9canxe2x80x9d xe2x80x9cisxe2x80x9d and xe2x80x9cof.xe2x80x9d Because these words appear so often in English, they are assumed to carry little distinguishing value for the IR task, and eliminating them from the index reduces the size of that index. Such words are referred to as stop words.
When an IR user enters a query, the system looks up each query word in the table and records which documents contained the query word. Normally, each document is assigned a statistical measure of relevance, based on the frequency of the query word occurrence, which assists the system in ranking the returned documents. For example, if Document X contained a particular search term 10 times, and Document Y contained the same term 100 times, Document Y would be considered more relevant to the search query than Document X. In practice, IR systems can implement very complex statistical models that take into account more than one search term, the length of each document, the relative frequency of words in general text, and other features in order to return more precise measures of relevance to the user.
Keyword-based information retrieval is often imprecise because its underlying assumption is often invalidxe2x80x94that a document""s content is represented by the frequency of word occurrences within the document. Two of the main problems with this assumption are that 1) words can have multiple meanings (polysemy), and 2) words in isolation often do not capture much meaning.
To illustrate polysemy, consider the word xe2x80x9cstock.xe2x80x9d In Wall Street Journal texts, this word is most often used as a noun, meaning a share of ownership in a company. In texts about ranching, however, the word refers to a collection of cattle. In texts about retail business, the word can be a verb, referring to the act of replenishing a shelf with goods. By searching on words alone, without regard to their meaning, a keyword-based IR system returns irrelevant documents to the user. Researchers refer to this type of inaccuracy as a lack of precision.
To illustrate the issue behind working with words in isolation, consider the following two sentences.
1. The elephant ran past me.
2. The elephant ran over me.
Note that the only difference between the two sentences is the change in the preposition from past to over. Clearly, however, the sentences connote two very different occurrences. Keyword-based IR systems are unable to recognize the distinction because they do not interpret the function of the prepositional phrases xe2x80x9cpast mexe2x80x9d and xe2x80x9cover mexe2x80x9d (they modify the elephant""s running). Additionally, prepositions are considered to be stop words by most IR systems, so sentence 1 and sentence 2 will be represented in the keyword index as if they were identical. This type of inaccuracy is another example of a lack of precisionxe2x80x94the user will receive irrelevant documents in response to his/her query.
Another issue with keyword-based information retrieval is that a user must be sure to enter the appropriate keyword in his/her query, or the IR system may miss relevant documents. For example, a user searching for the word xe2x80x9cairplanexe2x80x9d may find that searching on the term xe2x80x9cplanexe2x80x9d or xe2x80x9cBoeing 727xe2x80x9d will retrieve documents that would not be found by using the term xe2x80x9cairplanexe2x80x9d alone. Although some IR systems now use thesauri to automatically expand a search by adding synonymous terms, it is unlikely that a thesaurus can provide all possible synonymous terms. This kind of inaccuracy is referred to as a lack of recall because the system has failed to recall (or find) all documents relevant to a query.
Thus, in the prior art there is a clear need for a rapid and efficient search mechanism that will permit searching of natural language documents using an approach that recognizes meaning based on the relationships that words play with each other.