An important part of scientific research is the process of screening relevant literature. Due to the wealth of available information, however, screening literature is very time-consuming. Moreover, the volume of literature is constantly growing as new research results are published. For example, the study of protein-protein interactions, a key aspect of the field of proteomics, has gained a large amount of interest over the past decade. During a typical target validation phase of drug development, for example, scientists try to identify all of the interaction partners of a potential drug target in order to understand the effects and possible side effects of a proposed new drug. A search of Medline, the major public repository of scientific literature, for a single protein will typically return references to hundreds or thousands of documents. All of these documents must be screened in order to locate the desired information. If the query is refined in order to return fewer results, important documents can easily be overlooked.
As a further example, the future profitability of many businesses relies on decisions made today about where to invest for the future. One of the most important stages in such a decision-making process is the attempt to uncover existing intellectual property in a particular product space. Unfortunately, it is becoming more and more common for good products to fail to succeed because they infringe on existing intellectual property rights, resulting in wasted investment and derailed business strategies. The failure to succeed does not result from a lack of available information. For example, the U.S. Patent and Trademark Office allows free access to a full-text copy of every patent issued since 1976. Rather, the failure results from the need to analyze too much information. Analyzing each patent and patent application that contains a few select keywords is extremely laborious, and relevant references can easily be missed. Missing relevant references can be extremely costly in the long run.
In another example, businesses are interested in the activities of their competition. Information about which competitor is currently developing a particular product can be invaluable in making strategic decisions. Often, this information is present on publicly available information sources, such as data bases of research papers or patent applications. The key impediment to obtaining such information, however, is the difficulty in locating the relevant information within data bases that necessarily contain many millions of records. Making the best strategic decisions is not necessarily facilitated by having the most information, but rather by having the relevant information.
In the information age, a general lack of information is a less common problem than the inability to locate relevant information from an oversupply of data. In response to a query consisting of a few search terms, information retrieval systems aim to produce a list of documents that are usually ranked according to relevance. Such systems are usually quite unsophisticated and simply relying on returning documents that contain the search terms. Therefore, they normally produce poor results. They are unable to identify whether the meaning of a search term is the same as the meaning of a term used in a document. The inability to determine meaning can severely decrease the resulting precision, which is the ratio of the number of relevant results returned compared to the total number of results. Moreover, even simple linguistic relationships, such as the use of synonyms, abbreviations or more general terms, are often not taken into account. This causes many relevant results to be ignored and achieves only a modest recall, the ratio of the number of relevant results returned compared to the total number of relevant results available. Information retrieval systems commonly suffer from the problem that recall suffers when precision is improved, and vice versa.
Information extraction systems can obtain more accurate results than can be achieved through simple word matching by analyzing the text of a document. Some information extraction systems rely on an analysis technique called shallow parsing, in which words in a text passage are assigned to syntactic categories such as noun, verb, adjective and so on. This categorization is then used as the basis for a variety of statistical analyses. Information extraction systems using categorization and statistical analysis usually provide better results than do word-matching systems when judging whether or not a particular document is relevant. Nevertheless, the precision of such information extraction systems remains insufficient for most non-trivial applications. For example, such statistical systems are unable to distinguish between statements that assert that a particular fact is true and statements that assert the opposite, that the fact is not true.
Other information extraction systems rely on an analysis technique called deep parsing. Deep parsing involves a much more detailed analysis in which not only are the words in a sentence assigned to syntactic categories, but the relationships between the words are also identified. Nevertheless, information extraction using deep parsing has in the past yielded results not much better than those achievable using statistical methods.
An method is sought for searching for text passages in text documents that provides increased precision and that overcomes the limitations of existing analysis techniques.