Internet search engines have become fundamental tools for nearly all users seeking information and sites on the World Wide Web (WWW). Users can find vast amounts of data and select the data that appears to best match specific search criteria. Free-text searches are generally performed by providing a search phrase including one or more keywords, and optionally Boolean operators. The most widely used free-text search engines currently are provided by Google, Inc. and Yahoo, Inc.
Based on the search phrase provided by a user, a search engine generally returns a list of documents from which the user selects those that appear most relevant. The list typically includes a snippet from each of documents that includes one or more of the keywords, and the URL of the document. Typically, the search engine presents the list of documents in descending order according to general, static criteria established by the search engine provider. Numerous techniques have been developed for ranking the list in order to provide the results most likely to be relevant to a typical user. Some of these techniques take into account the order of the keywords provided by the user.
Such static ranking systems often present high-ranking results that do not match the interests or skills of the searcher, or that do not provide results that correctly reflect the intended meaning of keywords having more than one meaning. For example, a software engineer looking for Java (i.e., software) and a traveler looking for Java (i.e., the island) receive the same results for a query that includes the same keywords, even though their searches had different intended meanings.
In an attempt to increase the relevancy of search results, some search engines suggest search refinement options based on the search keywords entered by the searcher. These search engines typically analyze previous searches conducted by other users, in order to identify refinement options that are related to the keywords entered by the searcher. The searcher is able to narrow his search to better express his search intent by selecting one or more of the refinement options. For example, Google Suggest, provided by Google, Inc., displays a drop-down list of additional related search phrases, as the searcher enters a search query in a search text box. The Clusty search engine, provided by Vivisimo, Inc. groups similar results together into clusters. Some search engines, such as Google, upon detecting potential misspelling of search keywords, present a replacement search query including replacement keywords spelled correctly.
U.S. Pat. No. 6,636,848 to Aridor et al., which is incorporated herein by reference, describes a method for searching a corpus of documents, such as the World Wide Web, including defining a knowledge domain and identifying a set of reference documents in the corpus pertinent to the domain. Upon inputting a query, the corpus is searched using the set of reference documents to find one or more of the documents in the corpus that contain information in the domain relevant to the query. The set of reference documents is updated with the found documents that are most relevant to the domain. The updated set is used in searching the corpus for information in the domain relevant to subsequent queries.
U.S. Pat. No. 4,823,306 to Barbic et al., which is incorporated herein by reference, describes a method for searching for library documents that match the content of a given sequence of query words. A set of equivalent words are defined for each query word along with a corresponding word equivalence value assigned to each equivalent word. Target sequences of words in a library document which match the sequence of query words are located according to a set of matching criteria. The similarity value of each target sequence is evaluated as a function of the corresponding equivalence values of words included therein. Based upon the similarity values of its target sequences, a relevance factor is then obtained for each library document.
U.S. Pat. No. 5,987,457 to Ballard, which is incorporated herein by reference, describes a method in which a user views search results and subjectively determines if a document is desirable or undesirable. Only documents categorized by the user are analyzed for deriving a list of prospective keywords. The frequency of occurrence of each word of each document is derived. Keywords that occur only in desirable documents are good keywords. Keywords that occur only in undesirable documents are bad keywords. Keywords that occurs in both types are dirty keywords. The best keywords are the good keywords with the highest frequency of occurrence. The worst keywords are the bad keywords with the highest frequency of occurrence. A new query phrase includes the highest ranked good keywords and performs filtering using the highest ranked bad keywords. Key phrases are derived to clean dirty keywords into good key phrases. A key phrase also is derived from a good keyword and replaces the good keyword to narrow a search.
US Patent Application Publication 2005/0076003 to DuBose et al., which is incorporated herein by reference, describes a process for sorting results returned in response to a search query according to learned associations between one or more prior search query search terms and selected results of said prior search queries.
U.S. Pat. No. 6,732,088 to Glance, which is incorporated herein by reference, describes techniques for facilitating searching a data collection, such as the WWW, that take advantage of the collective ability of all users to create queries to the data collection. First, a node-link graph of all queries submitted to a data collection within a given period of time is constructed. In the case of the WWW, the queries would be to a particular search engine. In the graph, each node is a query. There is a link made between two nodes whenever the two queries are judged to be related. A first key idea is that the determination of relatedness depends on the documents returned by the queries, not on the actual terms in the queries themselves. For example, a criterion for relatedness could be that of the top ten documents returned for each query, the two lists have at least one document in common. A second key idea is that the construction of the query graph transforms single user usage of the data collection (e.g., search) into collaborative usage. As a result, all users can tap into the knowledge base of queries submitted by others, because each of the related queries represents the knowledge of the user who submitted the query.
U.S. Pat. No. 6,513,036 to Fruensgaard et al., which is incorporated herein by reference, describes techniques for searching and presenting electronic information from one or more information sources where the retrieval and presentation of information depends on context representations defined for a user performing the search, other users being similar to the user performing the search, and references to information. The context representation of each object affects/influences all the other objects with which it is in contact during the search process. This is described as ensuring a dynamic update of the relations between the objects and their properties.
US Patent Application Publication 2002/0133483 to Klenk et al., which is incorporated herein by reference, describes a system for automatically determining a characterizing strength which indicates how well a text in a database describes a search query. The system comprises a database storing a plurality of m texts, a search engine for processing the search query in order to identify those k texts from the plurality of m texts that match the search query. The system further comprises a calculation engine for calculating the characterizing strengths of each of the k texts that match the search query. The characterizing strength is calculated by creating a graph with nodes and links, whereby words of the text are represented by nodes and the relationship between words is represented by means of the links; evolving the graph according to a pre-defined set of rules; determining the neighborhood of the word, whereby the neighborhood comprises those nodes that are connected through one or a few links to the word; and calculating the characterizing strength based on the topological structure of the neighborhood.
U.S. Pat. No. 5,926,812 to Hilsenrath et al., which is incorporated herein by reference, describes a method for comparing the contents of two sets of documents, including extracting from a set of documents corresponding sets of document extract entries. The method further includes generating from the sets of document extract entries corresponding sets of word clusters. Each word cluster comprises a cluster word list having N words, an N×N total distance matrix, and an N×N number of connections matrix. The preferred embodiment includes grouping similar word clusters and combining the similar word clusters to form a single word cluster for each group. The grouping comprises evaluating a measure of cluster similarity between two word clusters, and placing them in a common group of similar word clusters if the measure of similarity exceeds a predetermined value. Evaluating the cluster similarity comprises intersecting clusters to form subclusters and calculating a function of the subclusters. In the preferred embodiment, the method is implemented in a system to automatically identify database documents which are of interest to a given user or users. In this implementation, the method comprises automatically deriving the first set of documents from a local data storage device, such as a user's hard disk. The method also comprises deriving the second set of documents from a second data storage device, such as a network machine. These techniques are described as providing fast and accurate searching to identify documents of interest to a particular user or users without any need for the user or users to specify what search criteria to use.
U.S. Pat. No. 6,772,150 to Whitman et al., which is incorporated herein by reference, describes a search engine system that uses information about historical query submissions to a search engine to suggest previously-submitted, related search phrases to users. The related search phrases are preferably suggested based on a most recent set of query submission data (e.g., the last two weeks of submissions), and thus strongly reflect the current searching patterns or interests of users.
U.S. Pat. No. 6,289,353 to Hazlehurst et al., which is incorporated herein by reference, describes an intelligent Query Engine system that automatically develops multiple information spaces in which different types of real-world objects (e.g., documents, users, products) can be represented. Machine learning techniques are used to facilitate automated emergence of information spaces in which objects are represented as vectors of real numbers. The system then delivers information to users based upon similarity measures applied to the representation of the objects in these information spaces. The system simultaneously classifies documents, users, products, and other objects. Documents are managed by collators that act as classifiers of overlapping portions of the database of documents. Collators evolve to meet the demands for information delivery expressed by user feedback. Liaisons act on the behalf of users to elicit information from the population of collators. This information is then presented to users upon logging into the system via Internet or another communication channel. Mites handle incoming documents from multiple information sources (e.g., in-house editorial staff, third-party news feeds, large databases, and WWW spiders) and feed documents to those collators which provide a good fit for the new documents.
US Patent Application Publication 2003/0123443 to Anwar, which is incorporated herein by reference, describes a search engine that utilizes both record based data and user activity data to develop, update, and refine ranking protocols, and to identify words and phrases that give rise to search ambiguity so that the engine can interact with the user to better respond to user queries and enhance data acquisition from databases, intranets, and internets.
The following patents, patent application publications, and other publications, all of which are incorporated herein by reference, may be of interest:    US Patent Application Publication 2005/0055341 to Haahr et al.    U.S. Pat. No. 5,987,457 to Ballard    U.S. Pat. No. 6,363,379 to Jacobson et al.    U.S. Pat. No. 6,347,313 to Ma et al.    U.S. Pat. No. 6,321,226 to Garber et al.    U.S. Pat. No. 6,189,002 to Roitblat    U.S. Pat. No. 6,167,397 to Jacobson et al.    U.S. Pat. No. 5,864,845 to Voorhees et al.    U.S. Pat. No. 5,825,943 to DeVito et al.    US Patent Application Publication 2005/0144158 to Capper et al.    US Patent Application Publication 2005/0114324 to Mayer    US Patent Application Publication 2005/0055341 to Haahr et al.    U.S. Pat. No. 5,857,179 to Vaithyanathan et al.    U.S. Pat. No. 7,139,755 to Hammond    U.S. Pat. No. 7,152,061 to Curtis et al.    U.S. Pat. No. 6,904,588 to Reddy et al.    U.S. Pat. No. 6,842,906 to Bowman-Amuha    U.S. Pat. No. 6,539,396 to Bowman-Amuha    US Patent Application Publication 2004/0249809 to Ramani et al.    US Patent Application Publication 2003/0058277 to Bowman-Amuha    U.S. Pat. No. 6,925,460 to Kummamuru et al.    U.S. Pat. No. 6,920,448 to Kincaid et al.    US Patent Application Publication 2006/0074883 to Teevan et al.    US Patent Application Publication 2006/0059134 to Palmon et al.    US Patent Application Publication 2006/0047643 to Chaman    US Patent Application Publication 2005/0216434 to Haveliwala et al.    US Patent Application Publication 2003/0061206 to Qian    US Patent Application Publication 2002/0073088 to Beckmann et al.