In recent years, the importance of the World Wide Web as a primary knowledge source has continually increased. Due to its wide availability and distributed structure, the Web allows a large population of users to express various opinions on an unbounded range of topics and issues, such as people, companies, organizations and products. The easiest method of finding the set of mentions of a subject of interest is to use a search engine. This approach may be feasible for relatively rarely-occurring subjects, but it quickly becomes impractical for commonly-used subject names. Furthermore, due to the infamous ambiguity of natural language, many names and other query terms may have several meanings. Thus, the challenge of searching a large, heterogeneous corpus of data like the Web becomes not only to find all the subject occurrences, but also to select only those occurrences that have the desired meaning.
For example, consider the Ford Explorer™ SUV. It is of potentially significant commercial value for Ford Corporation to track what people are saying about their product on the Web. To be able to do so, it is necessary to first collect a large number of Web pages that refer to the product name. Popular Web pages may refer to the Ford Explorer colloquially as Explorer, and pages of this sort may be of particular interest to the manufacturer. Simply searching for the term “explorer” is problematic, however, since the term is both frequent and highly ambiguous. A Google™ search for Explorer yields over 13 million hits, which include Internet Explorer, MSN Explorer, Mars Explorer, MedExplorer, and many more. Clearly, even a highly-motivated user will not be able to process these results effectively without further automated filtering.
Various methods are known in the art for refining search results and eliminating irrelevant search hits. For example, word sense disambiguation (WSD) attempts to determine the different possible senses of relevant words in a text of interest, and then to assign each occurrence of a word to the appropriate sense. Methods of WSD are surveyed by Ide and Veronis in “Word Sense Disambiguation: The State of the Art,” Computational Linguistics 24:1 (1998), pages 1-40, which is incorporated herein by reference. The specific problem of disambiguating proper names, such as Explorer, is addressed by Wacholder et al., in “Disambiguation of Proper Names in Text,” Fifth Conference on Applied Natural Language Processing (1997), pages 202-208, which is also incorporated herein by reference.
As noted by Ide and Veronis, disambiguation is typically based on two major sources of information: the context of the word to be disambiguated, and external knowledge sources, such as dictionaries. For example, U.S. Pat. No. 5,541,836, to Church et al., whose disclosure is incorporated herein by reference, describes apparatus and methods for word disambiguation, based on determining whether a word/sense pair is proper for a context. Wide contexts (100 words) are used for both training and testing, and testing is done by adding the weights of vocabulary words from the context. This patent also discloses training techniques, including training using categories from Roget's Thesaurus.
Another method for enhancing search accuracy is query refinement, which adds terms to the original query provided by the user in order to give more precise search results. For example, Mitra et al. describe a method for adding query terms by blind feedback, without user input, in “Improving Automatic Query Expansion,” Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1998), pages 206-214, which is incorporated herein by reference.
Focused Web crawling can be used as an adjunct to keyword searching, in order to find groups of Web pages that are connected by hyperlinks and are therefore likely to be related to a common domain. This sort of “goal-directed” crawling is described, for example, by Chakrabarti et al., in “Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery,” Computer Networks 31 (1999), pages 1623-1640, which is incorporated herein by reference. The focused crawler attempts to selectively seek out pages that are relevant to a predefined set of topics, which are typically specified using exemplary documents.
Other, related methods for document search, disambiguation and classification are described, for example, in U.S. Pat. Nos. 5,371,807; 5,873,056; and 6,038,560, whose disclosures are incorporated herein by reference.