The field of automatic retrieval of information from a natural language text corpus has in the past been focused on the retrieval of documents matching one or more key words given in a user query. As an example, most conventional search engines on the Internet use Boolean search for matches with the key words given by the user. Such key words are standardly considered to be indicative of topics and the task of standard information retrieval system has been seen as matching a user topic with document topics. Due to the immense size of the text corpus to be searched in information retrieval systems today, such as the entire text corpus available on the Internet, this type of search for information has become a very blunt tool for information retrieval. A search will most likely result in an unwieldy number of documents. Thus, it will take a lot of effort from the user to find the most relevant documents among the documents retrieved. Furthermore, due to the ambiguity of words and the way they are used in a text, many of the documents retrieved will be irrelevant. This will make it even more difficult for the user to find the most relevant documents.
The performance of an information retrieval system is usually measured in terms of its recall and its precision. In information retrieval, the technical term recall has a standard definition as the ratio of the number of relevant documents retrieved for a given query over the total number of relevant documents for that query. Thus, recall measures the exhaustiveness of the search results. Furthermore, in information retrieval, the technical term precision has a standard definition as the ratio of the number of relevant documents retrieved for a given query over the total number of documents retrieved. Thus, precision measures the quality of the search results. Due to the many documents retrieved when using the above type of search methods, it has been realized within the art that there is a need to reduce the number of retrieved documents to the most relevant ones. In other words, as the number of documents in the text corpus increases, recall becomes less important and precision becomes more important. Thus, suppliers of systems for information retrieval have enhanced Boolean search by using relevance ranking metrics based on statistical methods. However, it is well known that thus highly ranked documents still comprise irrelevant documents. This is due to the fact that the matching is too coarse and does not take the context in which the matching words occur into account. In order to find the documents that are relevant to a user query, there is a need for the information retrieval system to in some way understand the meaning of a natural language query and of the natural language text corpus from which the information is to be extracted.
There are proposals within the art of how to create an information retrieval system that can find documents in a natural language text corpus that match a natural language query with respect to the semantic meaning of the query.
Some of these proposals relate to systems that have been extended with specific world knowledge within a given domain. Such systems are based on an extensive database of world knowledge within a single area. Creating and maintaining such databases of world knowledge is a well-known knowledge engineering bottleneck. Furthermore, such databases scale poorly and a database within one domain can not be ported to another domain. Thus, it would not be feasible to extend such a system to a general application for finding information in unrestricted text, which could relate to any domain.
Other proposals are based on underlying linguistic levels of semantic representation. In these proposals, instead of using verbatim matching of one or more key words, a semantic analysis of the natural language text corpus and the natural language query is performed and documents are returned that match the semantic content meaning of the query. However, creating a deep level semantic representation of very large natural language text corpora is a complex and demanding task. This is due to a multi-level representation of the text, different analysis tools for different levels and propagation of errors from one level to another. Because representations at different levels are interdependent and for reasons given above the resulting analyses will be fragile and error prone.