The following U.S. Patent documents provide descriptions of art related to the present application: U.S. Pat. No. 5,418,889, issued May 1995 to Ito (hereinafter “Ito”); U.S. Pat. No. 5,696,916, issued December 1997 to Hitachi (hereinafter “Hitachi”); U.S. Pat. No. 6,026,388, issued February 2000 to Liddy et al. (hereinafter “Liddy”); U.S. Pat. No. 6,185,592, issued February 2001 to Boguraev et al. (hereinafter “Boguraev 1”); U.S. Pat. No. 6,212,494, issued April 2001 to Boguraev (hereinafter “Boguraev 2”); U.S. Pat. No. 6,246,977, issued June 2001 to Messerly et al. (hereinafter “Messerly”); U.S. Pat. No. 6,263,335, issued July 2001 to Paik et al. (hereinafter “Paik”); and U.S. Pat. No. 7,421,645, issued September 2008 to Reynar (hereinafter “Reynar”).
Automatic text processing, which can include the tasks of information retrieval, knowledge engineering, machine translation, summarization, etc., requires a certain linguistic analysis to be performed.
This analysis, especially as the depth and complexity of the analysis increases from the primary lexical level to the semantic level, is based on the traditional knowledge of the language, e.g., vocabulary, morphology, etc., and on the so-called recognizing linguistic models or patterns that, to a certain extent, can model cognitive functions of a person performing text apprehension and that make use of concrete lexical units of the language, as well as their part-of-speech classes and elements of syntactical and semantic relationships. The two abovementioned types of knowledge together with statistical methods provide the basis for the algorithms of automatic recognition of various semantic components, relationships, and their attributes in text, e.g., keywords, objects and their parameters, agents, actions, facts, cause-effect relationships and others. In other words, they provide an automatic semantic labeling of natural language text in accordance with a previously specified classifier, for example, semantically labeling strings of text. The latter in turn is defined based on the final goal of the text processing task.
Some existing methods are aimed at databases having a strict structure and are manually compiled or aimed at text having strictly defined fields. A shallow linguistic analysis of text is usually performed, which does not produce high accuracy. In particular, the semantic labeling of strings of text boils down to a recognition of only several special types of semantic components or relationships. In this manner, Reynar provides application program interfaces for labeling strings of text with a semantic category or list while a user is creating a document and provides user e-commerce actions based on the category or list. A list may include, for example, a type label “Person Name” or “Microsoft Employee.”
Hitachi describes a system that uses a predefined concept dictionary with high-low relationships, namely, “is-a” relations and “part-whole” relations between concepts.
Liddy uses a similar technology for user query expansion in an information search system.
Ito describes the use of a knowledge base, including a causal model base and a device model base. The device model base has sets of device knowledge describing the hierarchy of devices of the target machine. The causal model base is formed on the basis of the device model base and has sets of causal relations of fault events in the target machine. Thus, the possible cause of failure in each element of a device is guessed on the basis of information about its structural connections with other elements of the device. Usually, these are the most “connected” elements, which are determined as the cause.
Boguraev 1 describes the performance of a deep text analysis where, for text segments, the most significant noun groups are marked on the basis of their usage frequency in weighted semantic roles.
Boguraev 2 describes the use of computer-mediated linguistic analysis to create a catalog of key terms in technical fields and also determine doers (solvers) of technical functions (verb-object).
Paik describes an information extraction system that is domain-independent and automatically builds its own subject knowledge base. The basis of this knowledge base is composed of concept-relation-concept triples (CRCs), where the first concept is usually a proper name. This is an example of a quite deep semantic labeling of text that relies on recognition of dyadic relations that link pairs of concepts and monadic relations that are associated with a single concept. The system extracts semantic relationships from the previously part-of-speech tagged and syntactically parsed text by looking for specialized types of concepts and linguistic clues, including some prepositions, punctuation, or specialized phrases.
Of course, the procedure of semantic labeling is restricted in this case by the framework of CRC relations. For example, recognition of cause-effect relationships can be performed only for objects occurring together with certain types of verbs. Although such recognition often requires a wider context, and it turns out that in the general case it should be based on a set of automatically recognized semantic components in texts, the so-called facts. For example, one of the components of such facts is a semantic notion of an “action,” in contrast to merely a “verb”. Taking into account the restriction inherent in the imposed framework of CRC relations, semantic labeling in this case requires the development of a large number of patterns which is very labor-consuming. Finally, such semantic labeling actually deals only with topical content of the text and does not take into account its logical content.
Messerly performs semantic labeling of text in the logical foam “deep subject-verb-deep object.” However, the abovementioned logical faun is purely a grammatical notion; “deep subject” and “deep object” are each only a “noun,” and a “verb” is only a “principle verb”.