The present exemplary embodiment relates generally to document processing. It finds particular application in conjunction with a method for developing a protocol for identifying text which expresses a given concept and a system for concept matching, and will be described with particular reference thereto. However, it is to be appreciated that the present exemplary embodiment is also amenable to other like applications.
Database and Internet searching are widely used for retrieving documents that are relevant to the information needs of a user. Many document processing problems involve finding text passages which express a given concept. Examples include information retrieval (IR), information extraction (IE), and question answering (QA) problems. Some concepts are generally expressed by a small set of fixed words or expressions and thus are readily easy to detect. For example, they may be detected automatically, using a simple keyword search or a set of regular expressions. Other concepts are more difficult to detect because their expressions are more varied. The problem stems from the fact that language is both productive and ambiguous: the same concept can be expressed by infinitely numerous expressions and, at the same time, the words making up the expressions can have different meanings in other contexts. Keyword searching is thus generally not applicable for concepts conveyed by a wide range of linguistic expressions.
Existing document processing systems typically deal with the productive nature of language by allowing substitution of similar expressions or expression schemata, where similarity is defined such that if one expression is relevant to a user's query, then similar ones can be assumed to be relevant as well. These similarities are usually defined using three levels of linguistic information: morphological equivalences, syntactic equivalences, and lexical semantic equivalences. For morphological equivalences a morphological processing component can detect word forms that are inflected or derived from the same root, e.g.: X acquires Y; X acquired Y; and the acquisition of Y by X. For syntactic equivalences: by using syntactic rules, a system can detect similarity between pairs of expressions such as: X acquired Y; and Y was acquired by X. For lexical semantic equivalences, a system can be provided (by hand or using corpus statistics) with information about various semantic relationships, e.g. synonymy or hyponymy, among lexical units, and this information can be useful for detecting similarities such as: X acquired Y; and X bought Y.
By using generic linguistic resources of the above sorts, a system can allow users to specify a single search pattern that matches a range of different expressions. For example, using currently available linguistic resources, a user searching for descriptions of acquisitions could conceivably write a single pattern that matched all of the above descriptions of transactions. However, the range of expressions that convey a given concept may be even greater. For example, simple morphological, lexical, and syntactic substitutions would not be sufficient to match that same pattern with expressions such as:                Y will end up being owned by X        Y became the new owner of X        
Some concepts tend to be expressed in relatively limited ways, and for such concepts, reasonable coverage and precision can be attained using the standard types of linguistic resources described above, if not with a single query pattern then with relatively few of them. The example of commercial transactions appears to be such a concept—in most corpora, the majority of commercial transactions can be matched with one of the patterns “X sell Y” or “Z buy Y” using the standard types of resources. For this reason, the problem of finding purchases of companies is often used as an example in the IE literature. But for other concepts, the number of patterns needed becomes unmanageable.
The challenge, then, is to provide a way of expressing patterns that generalize over the types of variation that cannot be accounted for using traditional morphological, syntactic, and lexical semantic resources. There has been a great deal of theoretical work on the sort of linguistic resources that would be necessary to do this in a general way—for example, the ideal semantic lexicon would contain, in a machine-processable form, the information that a purchase involves a change of ownership, and that an expression following “end up” is the resultant state of a change. A system endowed with such a lexicon could conceivably infer that “Y will end up being owned by X” is likely to indicate a purchase of X by Y. But in practice, the information that could be considered for inclusion in such a lexicon is boundless, and the process of encoding it is time-consuming and error-prone, so practical, general-purpose results do not appear to be forthcoming.
Many systems employ lexical resources, such as a “named entity recognizer,” which is a module that identifies a few types of expressions such as dates, numbers and names (primarily of people, places, and organizations), typically using a combination of fixed word lists and patterns. This allows the user to write extraction patterns in which all entities of a particular type, e.g. company names, are considered “similar” for the purposes of pattern matching. While named entity recognition can be useful for particular tasks, its range of applicability is limited.