1. Field of the Invention
The invention relates generally to information retrieval systems, and more particularly, the invention relates to a novel query/answer system and method for open domains implementing a deferred type evaluation of candidate answers.
2. Description of the Related Art
An introduction to the current issues and approaches of Questions and Answering (QA) can be found in the web-based reference http:/en.wikipedia.org/wiki/Question_answering. Generally, question answering is a type of information retrieval. Given a collection of documents (such as the World Wide Web or a local collection) the system should be able to retrieve answers to questions posed in natural language. QA is regarded as requiring more complex natural language processing (NLP) techniques than other types of information retrieval such as document retrieval, and it is sometimes regarded as the next step beyond search engines.
QA research attempts to deal with a wide range of question types including: fact, list, definition, How, Why, hypothetical, semantically-constrained, and cross-lingual questions. Search collections vary from small local document collections, to internal organization documents, to compiled newswire reports, to the world wide web.
Closed-domain question answering deals with questions under a specific domain (for example, medicine or automotive maintenance), and can be seen as an easier task because NLP systems can exploit domain-specific knowledge frequently formalized in ontologies. Open-domain question answering deals with questions about nearly everything, and can only rely on general ontologies and world knowledge. On the other hand, these systems usually have much more data available from which to extract the answer.
Alternatively, closed-domain might refer to a situation where only a limited type of questions are accepted, such as questions asking for descriptive rather than procedural information.
Access to information is currently dominated by two paradigms: a database query that answers questions about what is in a collection of structured records; and, a search that delivers a collection of document links in response to a query against a collection of unstructured data (text, html etc.).
One major unsolved problem in such information query paradigms is the lack of a computer program capable of answering factual questions based on information included in a collection of documents (of all kinds, structured and unstructured). Such questions can range from broad such as “what are the risks of vitamin K deficiency” to narrow such as “when and where was Hillary Clinton's father born”.
The challenge is to understand the query, to find appropriate documents that might contain the answer, and to extract the correct answer to be delivered to the user. Currently, understanding the query is an open problem because computers do not have human ability to understand natural language nor do they have common sense to choose from many possible interpretations that current (very elementary) natural language understanding systems can produce.
In the patent literature, U.S. Patent Publication Nos. 2007/0203863A1, U.S. 2007/0196804 A1, U.S. Pat. No. 7,236,968 and EP Patent No. 1797509 A2 describe generally the state of the art in QA technology.
U.S. Patent Pub. No. 2007/0203863 A1 entitled “Meta earning for question classification” describes a system and a method are disclosed for automatic question classification and answering. A multipart artificial neural network (ANN) comprising a main ANN and an auxiliary ANN classifies a received question according to one of a plurality of defined categories. Unlabeled data is received from a source, such as a plurality of human volunteers. The unlabeled data comprises additional questions that might be asked of an autonomous machine such as a humanoid robot, and is used to train the auxiliary ANN in an unsupervised mode. The unsupervised training can comprise multiple auxiliary tasks that generate labeled data from the unlabeled data, thereby learning an underlying structure. Once the auxiliary ANN has trained, the weights are frozen and transferred to the main ANN. The main ANN can then be trained using labeled questions. The original question to be answered is applied to the trained main ANN, which assigns one of the defined categories. The assigned category is used to map the original question to a database that most likely contains the appropriate answer. An object and/or a property within the original question can be identified and used to formulate a query, using, for example, system query language (SQL), to search for the answer within the chosen database. The invention makes efficient use of available information, and improves training time and error rate relative to use of single part ANNs.
U.S. Patent Publication No. 2007/0196804 A1 entitled “Question-answering system, question-answering method, and question-answering program” describes a question-answering system that is formed with an information processing apparatus for processing information in accordance with a program, and obtains an answer to an input search question sentence by searching a knowledge source, includes: a background information set; a first answer candidate extracting unit; a first background information generating unit; an accuracy determining unit; and a first background information adding unit.
U.S. Pat. No. 7,236,968 entitled “Question-answering method and question-answering apparatus” describes a question document is divided into predetermined areas, and it is judged whether each divided area is important, to thereby extract an important area. A reply example candidate likelihood value is calculated for each important area, the likelihood value indicating the degree representative of whether each reply example candidate corresponds to a question content. By using the reply example candidate likelihood value, important areas having similar meanings are combined to extract final important parts. A reply example candidate is selected for each important part from reply example candidates prepared beforehand. A reply example candidate reliability degree representative of certainty of each reply example candidate and a reply composition degree indicating whether it is necessary to compose a new reply are calculated, and by using these values, question documents are distributed to different operator terminals.
U.S. Pat. No. 7,216,073 provides a reference to question answering using natural language in addition to a comprehensive summary of prior art.
In the patent literature, U.S. Pat. No. 7,293,015 describes a method for retrieving answers to questions from an information retrieval system. The method involves automatically learning phrase features for classifying questions into different types, automatically generating candidate query transformations from a training set of question/answer pairs, and automatically evaluating the candidate transforms on information retrieval systems. At run time, questions are transformed into a set of queries, and re-ranking is performed on the documents retrieved.
In the patent literature, U.S. Pat. No. 7,313,515 describes techniques for detecting entailment and contradiction. Packed knowledge representations for a premise and conclusion text are determined comprising facts about the relationships between concept and/or context denoting terms. Concept and context alignments are performed based on alignments scores. A union is determined. Terms are marked as to their origin and conclusion text terms replaced with by corresponding terms from the premise text. Subsumption and specificity, instantiability, spatio-temporal and relationship based packed rewrite rules are applied in conjunction with the context denoting facts to remove entailed terms and to mark contradictory facts within the union. Entailment is indicated by a lack of any facts from the packed knowledge representation of the conclusion in the union. Entailment and contradiction markers are then displayed.
U.S. Pat. No. 7,299,228 describes a technique for extracting information from an information source. During extraction, strings in the information source are accessed. These strings in the information source are matched with generalized extraction patterns that include words and wildcards. The wildcards denote that at least one word in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern.
While the use of static ontologies (e.g., list of types and relations between them) is typical in some Question Answering systems, in a different but broadly related context of knowledge organization, the use of “dynamic ontologies,” i.e. user customizable dictionaries of terms and relations.
One U.S. Pat. No. 6,487,545 describes a knowledge catalog including a plurality of independent and parallel static ontologies to accurately represent a broad coverage of concepts that define knowledge. The actual configuration, structure and orientation of a particular static ontology is dependent upon the subject matter or field of the ontology in that each ontology contains a different point of view. The static ontologies store all senses for each word and concept. A knowledge classification system, that includes the knowledge catalog, is also disclosed. A knowledge catalog processor accesses the knowledge catalog to classify input terminology based on the knowledge concepts in the knowledge catalog. Furthermore, the knowledge catalog processor processes the input terminology prior to attachment in the knowledge catalog. The knowledge catalog her includes a dynamic level that includes dynamic hierarchies. The dynamic level adds details for the knowledge catalog by including additional words and terminology, arranged in a hierarchy, to permit a detailed and in-depth coverage of specific concepts contained in a particular discourse. The static and dynamic ontologies are relational such that the linking of one or more ontologies, or portions thereof, result in a very detailed organization of knowledge concepts.
Both static and dynamic ontologies are given in advance of query processing. These methods have only limited effectiveness. The performance of current QA systems even on restricted corpora is not good enough to provide significant productivity improvement over search.
Being able to answer factual query is of potential great value for the society as it enables real time access to accurate information. Similarly, advancing the state of the art in question answering has great business value, since it provides a real time view of the business, its competitors, economic conditions, etc. Even if it is in a most elementary form, it can improve productivity of information workers by orders of magnitude.
It would be highly desirable to provide a computing infrastructure and methodology for conducting questions and answers.