1. Field of the Invention
The invention relates generally to information retrieval systems, and more particularly, the invention relates to a novel query/answer system and method implementing a degree of parallel analysis for providing answers to questions based on generating and quickly evaluating many candidate answers.
2. Description of the Related Art
An introduction to the current issues and approaches of Questions Answering (QA) can be found in the web-based reference http://en.wikipedia.org/wiki/Question_answering. Generally, question answering is a type of information retrieval. Given a collection of documents (such as the World Wide Web or a local collection) the system should be able to retrieve answers to questions posed in natural language. QA is regarded as requiring more complex natural language processing (NLP) techniques than other types of information retrieval such as document retrieval, and it is sometimes regarded as the next step beyond search engines.
QA research attempts to deal with a wide range of question types including: fact, list, definition, How, Why, hypothetical, semantically-constrained, and cross-lingual questions. Search collections vary from small local document collections, to internal organization documents, to compiled newswire reports, to the world wide web.
Closed-domain question answering deals with questions under a specific domain (for example, medicine or automotive maintenance), and can be seen as an easier task because NLP systems can exploit domain-specific knowledge frequently formalized in ontologies. Open-domain question answering deals with questions about nearly everything, and can only rely on general ontologies and world knowledge. On the other hand, these systems usually have much more data available from which to extract the answer.
Alternatively, closed-domain might refer to a situation where only a limited type of questions are accepted, such as questions asking for descriptive rather than procedural information.
Access to information is currently dominated by two paradigms: a database query that answers questions about what is in a collection of structured records; and, a search that delivers a collection of document links in response to a query against a collection of unstructured data (text, html etc.).
One major unsolved problem in such information query paradigms is the lack of a computer program capable of answering factual questions based on information included in a large collection of documents (of all kinds, structured and unstructured). Such questions can range from broad such as “what are the risk of vitamin K deficiency” to narrow such as “when and where was Hillary Clinton's father born”.
User interaction with such a computer program could be either single user-computer exchange or multiple turn dialog between the user and the computer system. Such dialog can involve one or multiple modalities (text, voice, tactile, gesture etc.). Examples of such interaction include a situation where a cell phone user is asking a question using voice and is receiving an answer in a combination of voice, text and image (e.g. a map with a textual overlay and spoken (computer generated) explanation. Another example would be a user interacting with a video game and dismissing or accepting an answer using machine recognizable gestures or the computer generating tactile output to direct the user.
The challenge in building such a system is to understand the query, to find appropriate documents that might contain the answer, and to extract the correct answer to be delivered to the user. Currently, understanding the query is an open problem because computers do not have human ability to understand natural language nor do they have common sense to choose from many possible interpretations that current (very elementary) natural language understanding systems can produce.
In the patent literature, U.S. Patent Publication Nos. 20070203863A1, U.S.20070196804A1, U.S. Pat. No. 7,236,968 and EP Patent No. 1797509A2 describe generally the state of the art in QA technology.
U.S. Patent Pub. No. 2007/0203863A1 entitled “Meta learning for question classification” describes a system and a method are disclosed for automatic question classification and answering. A multipart artificial neural network (ANN) comprising a main ANN and an auxiliary ANN classifies a received question according to one of a plurality of defined categories. Unlabeled data is received from a source, such as a plurality of human volunteers. The unlabeled data comprises additional questions that might be asked of an autonomous machine such as a humanoid robot, and is used to train the auxiliary ANN in an unsupervised mode. The unsupervised training can comprise multiple auxiliary tasks that generate labeled data from the unlabeled data, thereby learning an underlying structure. Once the auxiliary ANN has trained, the weights are frozen and transferred to the main ANN. The main ANN can then be trained using labeled questions. The original question to be answered is applied to the trained main ANN, which assigns one of the defined categories. The assigned category is used to map the original question to a database that most likely contains the appropriate answer. An object and/or a property within the original question can be identified and used to formulate a query, using, for example, system query language (SQL), to search for the answer within the chosen database. The invention makes efficient use of available information, and improves training time and error rate relative to use of single part ANNs.
U.S. Patent Publication No. 2007/0196804A1 entitled “Question-answering system, question-answering method, and question-answering program” describes a question-answering system that is formed with an information processing apparatus for processing information in accordance with a program, and obtains an answer to an input search question sentence by searching a knowledge source, includes: a background information set; a first answer candidate extracting unit; a first background information generating unit; an accuracy determining unit; and a first background information adding unit.
U.S. Pat. No. 7,236,968 entitled “Question-answering method and question-answering apparatus” describes a question document is divided into predetermined areas, and it is judged whether each divided area is important, to thereby extract an important area. A reply example candidate likelihood value is calculated for each important area, the likelihood value indicating the degree representative of whether each reply example candidate corresponds to a question content. By using the reply example candidate likelihood value, important areas having similar meanings are combined to extract final important parts. A reply example candidate is selected for each important part from reply example candidates prepared beforehand. A reply example candidate reliability degree representative of certainty of each reply example candidate and a reply composition degree indicating whether it is necessary to compose a new reply are calculated, and by using these values, question documents are distributed to different operator terminals.
U.S. Pat. No. 7,216,073 provides a reference to parallel processing in question answering using natural language in addition to a comprehensive summary of prior art.
In the patent literature, U.S. Pat. No. 7,293,015 describes a method for retrieving answers to questions from an information retrieval system. The method involves automatically learning phrase features for classifying questions into different types, automatically generating candidate query transformations from a training set of question/answer pairs, and automatically evaluating the candidate transforms on information retrieval systems. At run time, questions are transformed into a set of queries, and re-ranking is performed on the documents retrieved.
In the patent literature, U.S. Pat. No. 7,313,515 describes techniques for detecting entailment and contradiction. Packed knowledge representations for a premise and conclusion text are determined comprising facts about the relationships between concept and/or context denoting terms. Concept and context alignments are performed based on alignments scores. A union is determined. Terms are marked as to their origin and conclusion text terms replaced with by corresponding terms from the premise text. Subsumption and specificity, instantiability, spatio-temporal and relationship based packed rewrite rules are applied in conjunction with the context denoting facts to remove entailed terms and to mark contradictory facts within the union. Entailment is indicated by a lack of any facts from the packed knowledge representation of the conclusion in the union. Entailment and contradiction markers are then displayed.
U.S. Pat. No. 7,299,228 describes a technique for extracting information from an information source. During extraction, strings in the information source are accessed. These strings in the information source are matched with generalized extraction patterns that include words and wildcards. The wildcards denote that at least one word in an individual string can be skipped in order to match the individual string to an individual generalized extraction pattern.
U.S. Pat. No. 6,665,666 describes a technique for answering factoid questions based on patterns and question templates, and utilizing a search process over a repository of unstructured data (text).
Methods of generating automatically natural language expressions from a formal representation have been previously disclosed, for example, in the U.S. Pat. Nos. 5,237,502 and 6,947,885.
U.S. Pat. Nos. 6,829,603 and 6,983,252 teach how an interactive dialog system using a dialog manager module maintains and directs interactive sessions between each of the users and the computer system and how to provide a mechanism for providing mixed-initiative control for such systems. A mixed initiative approach is where the user is not constrained to answer the system's direct questions but may answer in a less rigid/structured manner. U.S. Pat. No. 7,136,909 teaches how the interactive dialog systems can be extended to multimodal communication for accessing information and service, with the interaction involving multiple modalities of text, audio, video, gesture, tactile input and output, etc.
Being able to answer factual query in one or multiple dialog turns is of potential great value for the society as it enables real time access to accurate information. Similarly, advancing the state of the art in question answering has great business value, since it provides a real time view of the business, its competitors, economic conditions, etc. Even if it is in a most elementary form, it can improve productivity of information workers by orders of magnitude.
It would be highly desirable to provide a computing infrastructure and methodology for conducting questions and answers.