This invention relates to the field of querying and searching collections of text. More specifically, the invention relates to querying and searching collections of text in a networking environment.
Text documents contain a great deal of factual information. For example, an encyclopedia contains many text articles consisting almost entirely of factual information. Newspaper articles contain many facts along with descriptions of newsworthy events. The World Wide Web contains millions of text documents, which in turn contain at least a small amount of factual information.
Given this collection of factual information, we naturally desire the ability to answer questions based on this information using automatic computer programs. Previously two kinds of computer programs have been created to search factual information: database management systems and information retrieval systems. A database management system (DBMS) assumes that information is stored in a structured fashion, such that each data element has a known data type and a set of legal operations. For example, the relational database management system (RDBMS) provides the Structured Query Language, or SQL, which specifies a syntax and grammar for the formulation of queries against the database. SQL is based on a relational calculus that restricts queries to include only certain operations on certain data types, and certain combinations of those operations.
A relational database is tailored to applications where the factual information is available in a structured form. To address the factual information contained in free text documents, information retrieval (IR) systems were created. An information retrieval system indexes a collection of documents using textual features (e.g., words, noun phrases, named entities, etc.). The document collection can then be searched using either Boolean queries or natural language queries. A Boolean query consists of textual features and Boolean operators (e.g., and, or, not). To evaluate a Boolean query, an IR system returns the set of documents that satisfies the Boolean expression. A natural language query is a free form text query that describes the user""s information need. Documents likely to satisfy this information need are then found using a retrieval model that matches the query with documents in the collection. Popular models include the probabilistic and vector space models, both of which use text feature occurrence statistics to match queries with documents. In all cases, an IR system only identifies entire documents that are likely to satisfy a user""s information need.
Ideally, the user would be able to phrase a specific question, e.g., xe2x80x9cWhat is the population of the world?xe2x80x9d, and the computer program would respond with a specific answer, e.g., xe2x80x9c6 billionxe2x80x9d. Moreover, the computer program will produce these answers by analyzing the factual information available in the vast supply of text documents, examples of which were given previously. Thus, the problem at hand is how to automatically process free text documents and provide specific answers to questions based on the factual information contained in the analyzed documents.
When users have an information need, search engines are typically used to find the desired information in a large collection of documents. The user""s query is treated as a bag of words, and these are matched against the contents of the documents in the collection; the documents with the best matches, according to some scoring function, are returned at the top of a hit-list. Such an approach can be quite effective in the case one is looking for information about some topic. However, if one desires an answer to a question, a different approach has to be attempted for the following reasons: (1) Using a standard search engine approach, the user gets back documents, rather than answers to a question. This then requires browsing the documents to see if they do indeed contain the desired answers (which they may not) which can be a time consuming process. (2) No attempt is made to even partially understand the question, and make appropriate modifications to the processing. So, for example, if the question is xe2x80x9cWhere is XXXxe2x80x9d, the word xe2x80x9cwherexe2x80x9d will either be left intact and submitted to the search engine, which is a problem since any text giving the location of XXX is very unlikely to include the word xe2x80x9cwherexe2x80x9d, or the word will be considered a stop-word and stripped out, leaving the search engine with no clue that a location is sought.
The above discussion describes the most commonly found situation. There are two approaches that have been used in an attempt to provide better service for the end-user.
The first of these does not directly use search engines at all, and is currently in use by AskJeeves (www.askjeeves.com). This approach uses a combination of databases of facts, semantic networks, ontologies of terms and a way to match user""s questions to this data to convert the user""s question to one or more standard questions. Thus the user will ask a question, and the system will respond with a list of questions that the system can answer. These latter questions match the user""s question in the sense that they share some keywords in common. A mapping exists between these standard questions and reference material, which is usually in the form of topical Web pages. This is done by generating for these pages one or more templates or annotations, which are matched against the user""s questions. These templates may be either in natural-language or structured form. The four major problems with this approach are:
(1) Building and maintaining this structure is extremely labor-intensive and potentially error-prone, and is certainly subjective.
(2) When new textual material (such as news articles) comes into existence, it cannot automatically be incorporated in the xe2x80x9cknowledge-basexe2x80x9d of the system, but must wait until a human makes the appropriate links, if at all. This deficiency creates a time-lag at best and a permanent hole in the system""s capability at worst. Furthermore, using this approach, only a pointer to some text where the answer may be found is given instead of the answer itself. For instance, asking: How old is President Clinton? returns a set of links to documents containing information about President Clinton, however, there is no guarantee that any of these documents will contain the age of the President. Generating these templates automatically cannot be done accurately with the current state of the art in automatic text understanding.
(3) It can easily happen that there is no match between the question and pre-stored templates; in such cases these prior art systems default to standard (non-Question-Answering) methods of searching.
(4) There is no clear way to compute the degree of relevance between the question and the text that is returned, so it is not straightforward to determine how to rank-order these texts.
The second approach uses traditional search-engines with post-processing by linguistic algorithms, and is the default mechanism suggested and supported by the TREC-8 Question-Answering track. In this approach, a question is submitted to a traditional search engine and documents are returned in the standard manner. It is expected that many of these documents will be false hits, for reasons outlined earlier. Linguistic processing is then applied to these documents to detect one (or more) instances of text fragments that correspond to an answer to the question in hand. The thinking here is that it is too computationally expensive to apply sophisticated linguistic processing to a corpus that might be several gigabytes in size, but it is reasonable to apply such processing to a few dozen or even a few hundred documents that come back at the top of the hit list. The problem with this approach, though, is that, again for reasons given earlier, even the top documents in the hit-list so generated might not contain the sought-after answers. In fact, there may well be documents that do answer the questions in the corpus, but score so poorly using traditional ranking algorithms that they fail to appear in the top section of the hit-list that is subject to the linguistic processing.
The prior art can answer questions that are structured (SQL) posed against structured data but can""t deal with unstructured questions/data. EasyAsk (Tm) (www.EasyAsk.com) is a system which (after some training on the underlying database) takes question posed in plain English and translates them into an SQL query which then retrieves the data. The answers of the questions are constrained to some value as stored in the database.
An object of this invention is an improved system and method for determining specific answers from queries of text.
An object of this invention is an improved system and method for determining answers from queries against free form text.
An object of this invention is an improved system and method for determining answers using free form queries.
The present invention is a system, method, and program product that comprises a computer with a collection of documents to be searched. The documents contain free form (natural language) text. We define a set of labels called QA-Tokens, which function as abstractions of phrases or question-types. We define a pattern file, which consists of a number of pattern records, each of which has a question template, an associated question word pattern, and an associated set of QA-Tokens. We describe a query-analysis process which receives a query as input and matches it to one or more of the question templates, where a priority algorithm determines which match is used if there is more than one. The query-analysis process then replaces the associated question word pattern in the matching query with the associated set of QA-Tokens, and possibly some other words. The tokens in the processed query (the words and QA-Tokens) are optionally converted to lemma form, stop-words are optionally removed and tokens are optionally assigned weights. This results in a processed query having some combination of original query tokens, new tokens from the pattern file, and QA-Tokens, possibly with weights. We describe a pattern-matching process that identifies patterns of text in the document collection and augments the location with corresponding QA-Tokens. We define a text index data structure which is an inverted list of the locations of all of the words in the document collection, together with the locations of all of the augmented QA-Tokens. A search process then matches the processed query against a window of a user-selected number of sentences that is slid across the document texts. The windows are assigned a score from the number of matches of words in the window with words in the processed query, weighting if desired. A hit-list of top-scoring windows is returned to the user.