1. Field
This application relates to systems and methods of automatic knowledge recognition in text documents and natural language interface for accessing the user Knowledge Base aimed at cross-language knowledge/document extraction, relevant to the user request.
2. Description of Related Art
The following U.S. Patent and U.S. Patent Publication documents provide descriptions of art related to the present application: U.S. Pat. No. 5,404,295, issued April 1995 to Katz et al. (hereinafter Katz et al.); U.S. Pat. No. 5,694,592, issued December 1997 to Driscoll (hereinafter Driscoll); U.S. Pat. No. 5,724,571, issued March 1998 to Woods (hereinafter Woods); U.S. Pat. No. 5,794,050, issued August 1998 to Dahlgren et al. (hereinafter Dahlgren et al.); U.S. Pat. No. 5,933,822, issued August 1999 to Braden-Harder et al. (hereinafter Braden-Harder et al.); U.S. Pat. No. 5,966,686, issued October 1999 to Heidorn et al. (hereinafter Heidorn et al.); U.S. Pat. No. 6,381,598, issued April 2002 to Williamowski et al. (hereinafter Williamowski et al.); and U.S. Publication No. 20040261021, published December 2004 by Mittal et al. (hereinafter Mittal et al.).
The following non-patent documents also provide descriptions of art related to the present application:
Radev D. R. et al. “Ranking Suspected Answers to Natural Language Question Using Predictive Annotation”, Proceedings of the 6th Applied Natural Language Processing Conference, pp. 150-157, Apr. 29-May 4, 2000 (hereinafter Radev et al.);
Srihari R. et al. “A Question Answering System Supported by Information Extraction”, Proceedings of the 6th Applied Natural Language Processing Conference, pp. 166-172, Apr. 29-May 4, 2000 (hereinafter Srihari et al.);
Cardie C. et al. “Examining the Role of Statistical and Linguistic Knowledge Sources in a General-Knowledge Question-Answering System”, Proceedings of the 6th Applied Natural Language Processing Conference, pp. 180-187, Apr. 29-May 4, 2000 (hereinafter Cardie et al.); and
Abney S. et al. “Answer Extraction”, Proceedings of the 6th Applied Natural Language Processing Conference, pp. 296-301, Apr. 29-May 4, 2000 (hereinafter Abne1y et al.).
In information-providing systems, information or knowledge may be retrieved or extracted in accordance with user requests or queries. It is preferable that the user requests are formulated in the natural language (NL). Given such queries, the system tries to present them in a formal way by means of special analysis. Such attempts are referred to as NL understanding systems. The first forms of presentation were sequences of keywords, Boolean expressions composed of keywords, particular lexical units, etc.
It is not difficult to see that further investigations in the art were required. New computer-based technologies have been developed. Such techniques have, for example, dealt with preprocessing available information and analyzing a user request with linguistic means.
For preprocessing, corpus texts may be subjected to stages of tagging, parsing and semantic analysis. The stage of tagging, or morphological analysis, comprises word and punctuation symbol extraction from the text followed by attaching the dictionary information to each word, namely all possible forms, senses and grammatical roles the word can have in the sentence. During the parsing stage, syntactic structure of the sentence is presented in the form of syntax parse tree where each leaf node represents one word or punctuation mark of the sentence. Intermediate-level leaves stand for different syntactic formations—e. g., a noun phrase, a verb phrase, a prepositional phrase, etc.—consisting in their turn of other syntactic formations or ordinary words and punctuation marks; the composition of these nodes is reflected by linking them from below to one or more existing nodes. A single root node of a complete syntax parse tree represents entire sentence. The semantic analysis stage assumes a deeper level of understanding the text, a level analogous to that achieved by a human reader. This last stage derives various semantic roles words and syntactic formations at play in the text, such as deep subject, deep object, clause, hypernym, means, etc.
User requests may be subjected to a similar three stage analysis as well. Systems exist which are developed specifically to work with input strings in the form of full sentence questions. These systems tag, parse, and analyze semantic structure of a user question.
A machine's understanding of the semantic structure of both the corpus texts and a user request helps in furnishing an adequate response to input question. That is, this understanding will allow the provision of knowledge embodied in the corpus texts that best fulfills the user request.
The use of part-of-speech (POS) tagging, parsing, and semantic analysis allows the construction of a more correct formal representation of a user query, although some systems also use a dialog with the user. Systems that use tagging, parsing and semantic analysis are known in the art. For example, Katz et al. translate user requests (but not all of them) into a structured form. Dahlgren et al. use a NL understanding module (including naïve semantic lexicon, noun and verbs phrase recognition), that receives a NL input and generates a first order logic (FOL) output. Both Braden-Harder et al. and Heidorn et al. translate a user request into a logical form graph (LFG), that is, a set of logical form triples. The Braden-Harder and Heidorn method significantly improves a statistical-based search engine, but it is designed only for the queries in the form of a single sentence or a sentence fragment. The LFG determines semantic relations between important words in a phrase (deep subject, deep object, etc.), but, in fact, it means grammatical subject, object, etc. Besides, query separation into triples destroys its integral semantic representation. The LFG element, to which the question is asked, is not registered. As a result, the system searches for relevant documents, but not exact answers to the user question.
A drawback of these natural language processing (NLP) systems is that it usually becomes increasingly difficult to add new semantic rules to the system. Adding of a new rule generally involves new procedural logic that may conflict with that already programmed in the semantic subsystem. The size and complexity of a LGF or FOL makes the use of them quite difficult and even inefficient for solving many tasks.
Another approach to the development of a NL interface consists not in performing a thorough linguistic analysis of the user query, but in implementing a certain algorithm for the search of separate words that form the query in a document with a subsequent calculation of a relevance level. For example, Driscoll and Woods describe the use of a technique called “relaxation ranking” to find specific passages, where the highest number of query elements was found together, preferably in the same form and order. Radev et al. and Shirai et al. developed a similar approach by combining Question Answering (QA) and NLP techniques. Radev et al. and Shirai et al. don't use full-scale NLP, but some elements of questions and text documents are indexed by means of semantic categories, for example, Q/A Tokens as described in Radev et al. Cardie et al. combine methods of standard ad-hoc information retrieval (IR), query-dependent text summarization and shallow semantic sentence analysis. However, the Cardie system focuses on the extraction of noun phrases and uses a dialog with the user. Abney et al. makes use of both IR and NLP technologies; this makes the Abney system more robust in comparison with a pure NLP method, while affording greater precision than a pure IR system would have. But the Abney authors themselves admit that comparatively low quality of the system requires improvement of the NLP component, development of a larger question corpus, etc.
Thus, regardless of the fact that there exist many different approaches to building systems of analysis/understanding of the text, none of them provides an ideal NL user interface. Moreover, failure to perform NL analysis of the user query, or shallow analysis, may bring inadequate results. Woods states that “linguistic knowledge can improve information retrieval”—so that this thesis should be considered relevant in the solving the problem. Asking questions, a user wants to receive relevant information, i.e., knowledge. Main elements of this knowledge are: objects/concepts, (for example: invention, cool water); facts (fire heats water); and cause-effect relations between the facts formulated in the form of rules that reflect the regularities of the outer world/subject domain (for example: if F1 (fire heats water to 100 deg.) then F2 (water boils)).
Based on the recognition of this linguistic knowledge in text documents, US Patent Appl. Pub. No. 20020010574, titled “Natural Language Processing and Query Driven Information Retrieval”; US Patent Appl. Pub. No. 20020116176, titled “Semantic Answering System and Method”; and US Patent Appl. Pub. No. 20030130837, titled “Computer-based Summarization of Natural Language Documents” describe another approach to the analysis of NL user requests and text documents, based on the complete and correct POS-tagging, parsing and semantic analysis of NL. The approach provides analysis of any user NL request and/or text document, and search of knowledge, concerning the objects, facts and regularities of the outer world/subject domain, and also any of the elements (properties, relations) of this knowledge.
New possibilities of the efficient solutions to search problems and knowledge engineering caused further growth in the usage of text resources. However, knowledge necessary to a user can be contained in documents in different languages, while the user prefers to communicate with the system in his or her native language. This then results in the problem of cross-language knowledge search and extraction. Existing systems including those mentioned above are aimed at information search, not knowledge search. Therefore, those which address the “cross-language problem” typically solve it by simply translating keywords from a user query using bilingual dictionaries. For example, Williamowski et al. use an expression formed by keywords (elementary words) and Boolean operators as a user query. These words are then translated using domain specific dictionaries and stemmed, resulting in a set of combinations of stemmed and translated elementary words. Using this set of user query search expressions, the Williamowski et al. system performs a conventional keyword search in documents in corresponding natural languages, verifying the correct linguistic structure of the search keywords in the retrieved documents. Mittal et al. translate terms obtained from a user query written in a first format into a second format using a probabilistic dictionary; search a database for information relevant to the translated query and returns to the user search results written in the second format. Unlike the Williamowski et al method, Mittal et al. suggest a method for building the probabilistic dictionary using Google™ anchor-based corpora. Such kinds of corpora typically have poor semantic structures in sentences, and may not be used for precise semantic comparison, therefore resulting in essentially a keyword search.
Hence, given the necessity of deep linguistic (including semantic) analysis of user query and text documents, embodiments of the present invention address the “cross-language problem” by considering the results of such analysis, even at the dictionary building stage.