1. Field of the Invention
The present invention is directed to a system and method for searching and retrieving electronic documents in response to a query.
2. Description of Related Art
Electronic searching across large document corpora is one of the most broadly utilized applications on the Internet, and in the software industry in general. Regardless of whether the sources to be searched are a proprietary or open-standard database, a document index, or a hypertext collection, and regardless of whether the search platform is the Internet, an intranet, an extranet, a client-server environment, or a single computer, searching for a few matching texts out of countless candidate texts, is a frequent need and an ongoing challenge for almost any application.
One fundamental search technique is the keyword-index search that revolves around an index of keywords from eligible target items. In this method, a user's inputted query is parsed into individual words (optionally being stripped of some inflected endings), whereupon the words are looked up in the index, which in turn, points to documents or items indexed by those words. Thus, the potentially intended search targets are retrieved. This sort of search service, in one form or another, is accessed countless times each day by many millions of computer and Internet users. It is, for example, built into database kits offered by companies such as Oracle® and IBM®, which are utilized by many of the Fortune® 1000 companies for internal data management; it is built into the standard help file utility on the Windows® operating system, which is used on most personal computers today; and it is the basis of the Internet search services provided by Lycos®, Yahoo®, and Google®, used by tens of millions of Internet users daily.
Two main problems of keyword searches are (1) missing relevant documents (i.e. low recall), and (2) retrieving irrelevant ones (i.e. lack of precision). Most keyword searches do plenty of both. In particular, with respect to the first problem, the primary limitation of keyword searches is that, when viewed semantically, keyword searches can skip about 80% of the eligible documents because, in many instances, at least 80% of the relevant information will be indexed in entirely different words than words entered in the original query. Granted, for simple searches with very popular words, and where relevant information is plentiful, this is not much of a problem. But for longer queries, and searches where the relevant phrasing is hard to predict, results can be disappointing.
Some of the questions that arise in this context are:
How can a search engine recognize where there are equivalent words for the query words, e.g. that “mother-daughter matching sleeping gowns” matches “adult-child coordinated night gown set”?
How can a search engine recognize that “hotel room with a view of the Golden Gate Bridge” matches “suite that provides a panorama of the entire Bay Area skyline” where the phrase “Bay Area skyline”, while not synonymous with “Golden Gate Bridge,” is nonetheless very strongly related to it?
The second main problem in keyword search is that, not only do keyword searches overlook relevant matching texts, they also erroneously match irrelevant texts, due largely to the fact that words can be used in different senses.
Examples of questions that arise in this context are:
How can a search engine recognize that “bank an aircraft in high wind” is NOT a match for “His investment bank funded an aircraft company whose high sales brought in a windfall profit,” despite that it has a high correspondence to the series of words in the query?
How can a search engine recognize that “Apple Slashes Price of Newest Macintosh” should match documents concerning personal computers and not the agriculture industry?
The common attempts at this problem revolve around various kinds of popularity ranking, e.g. with Google® the most-linked-to content across the Web, and/or with other search engines, the content that is most searched-for or most clicked-on-in-search-results-pages. However, the popularity is inferred, and there are a number of cases where popularity does not represent the intention of a particular user. Thus, this method, while it is guaranteed to work in a significant number of cases (the most popular ones), is guaranteed also not to work in all the other cases other than the most popular case.
Attempts have been made to address the above described missed relevant documents problem, or the low recall problem. Probably the most straightforward approach is to automatically add terms that are considered to be synonyms to the original query terms to the query. This is easily done by simple look-ups in a machine readable thesaurus or “WordNet.” Most common synonyms are added automatically, and search is conducted for the query words as well as the synonym terms added. Unfortunately, this approach encounters some very significant problems in that:                1. Words have many different senses;        2. Words have many synonyms in each sense;        3. Most synonyms themselves have other senses which are NOT equivalent to the original word.        
In fact, a study discussed in G. Grefenstette: Explorations in Automatic Thesaurus Discovery, pg. 112, Boston (1994), found that thesaurus based automatic query expansion, without making any semantic discrimination such as word sense disambiguation, varied the accuracy of the search from negative 17.7% to positive 10.4%. Thus, such simple expansion of search query terms sometimes made the accuracy of the search results worse, and never made accuracy better than approximately 10%.
This is understandable because, for example, the word “bank” can mean a financial institution, the edge of a river, the turning of an aircraft, the willingness to believe something (“you can bank on it!”), etc. Taking the second of these senses, the word “turn,” though it can be a valid synonym of “bank,” will also have other senses (such as in “it's your turn” or “the turn of the century”, etc.) which have nothing to do with any of the senses of “bank.” This means that automatically adding all the terms that are considered to be synonyms of every query term usually creates more irrelevant hits, not fewer. While the synonyms do give the benefit of enabling the search engine to find more relevant information, that effect is overshadowed by the creation of a mountain of additional, irrelevant search results. Thus, adding the synonyms turns out to make matters worse, not better.
The irrelevant result, or lack of precision problem is practically the opposite, or the “converse” in that instead of missing a document that is relevant, the search engine includes results that are not actually relevant. This usually happens because, again, words can be used in variant senses, meaning that a document can satisfy the query perfectly when viewed from the perspective of a keyword-match rate, but the words in the target document may have been used in different senses from those in the query so that the document is irrelevant. Although this seems to be an “opposite” problem, it really derives from the same fundamental problem which is the inability of keyword search engines to be cognizant of word senses.
Thus, the two recall and precision problems of searches (missed candidates and irrelevant results) share some important things in common in that both problems are rooted in the failure to distinguish word senses, and both have had their attempted solutions suffer from creating, in at least some respects, a worse picture, rather than a better one for the user. Thus, there exists an unfulfilled need for a system that can address the problem of word sense disambiguation.
In order to appreciate how widespread, and how consternating the problem of polysemy (multiple meaning) of words can be, consider the word senses for the word “Space” which include: Outer space (noun); Real estate “vacant space” (noun); Blank space on a paper such as for signature (noun); Blank space between letters in a sentence (noun); “space the fence posts farther apart, please” (verb); “space my appointments farther apart, please” (temporal application); to go into a trance “he spaced out” (not in most lexicons); Industry niche “competitors in our space” (not in most lexicons). Other examples of common, highly polysemous words are: bank, break, call, dark, date, interest, love, mean, plane, play, stage, time, try, view, window, and thousands of other words.
Conventional methods of word sense disambiguation proposed in the art generally proceed along the following lines:                1. Manually sense-tag corpus of texts (mark each word as to its canonical sense). One will use most of this data as the “training data” while saving a minority portion for the “testing data.”        2. Using the training data, for each sense of each word, extract contextual features (e.g. record which words are found frequently occurring next to, or in the same sentence as, or within n words distance of the target word).        3. Determine common patterns in the contextual features (e.g. apply any standard machine learning algorithm, whether that be neural nets, or case-based reasoning, or genetic classifiers, or other) to enable classification among several senses of a word, and validate the classifier on the testing data.        
After the foregoing project is completed, then based on the determined patterns (or feature value-sets, or derived rules concerning them) of the classifier, new occurrences of words (given a surrounding context, i.e. the text before and/or after the word) can be assigned a guess, or a probability, of having certain senses, i.e. be classified according to their canonical sense. A considerable amount of research and debate has surrounded steps 2 and 3 of this process, and it is no doubt fruitful to investigate and optimize these phases. However, the conventional methods of word sense disambiguation proposed agree on Step 1. A large set of manually tagged training data is presumed in the vast majority of methods attempted in word sense disambiguation.
Other related art include Ramakrishnan, G., Prithviraj, B., Deppa, A., Bhattacharya, P., and Chakrabarti, S.: Soft Word Sense Disambiguation, Proceedings of the Second International WordNet Conference—GWN 2004, 291-298, Masaryk University Brno, Brno, Czech Republic, December (2003). Ramakrishnan has proposed a soft word disambiguation which does not “pin down” a single word sense for every query term. Another related art includes Agirre, E. and Martinez, D.: Exploring Automatic Word Sense Disambiguation With Decision Lists and the Web, Proceedings of the COLING 2000 Workshop on Semantic Annotation and Intelligent Content, Saarbrcken, Germany (2000). Agirre and Martinez have proposed employing a Bayesian network in automating the disambiguation task. Still another related art is Hearst, M.: Texttiling: Segmenting Text Into Multi-Paragraph Subtopic Passages, Computational Linguistics 23(1): 33-64 (1997). Yet another related art includes Beaulieu, M.: Experiments On Interfaces To Support Query Expansion, Journal of Documentation 53.1 (1997), pg. 8-9.