This specification includes as an appendix a C++ listing of a spelling comparison function used to compare two character strings. The contents of the appendix are subject to copyright protection. The copyright owner has no objection to the facsimile reproduction of the patent document or portions thereof as it appears in the files or records of the U.S. Patent and Trademark Office, but otherwise reserves all rights whatsoever.
The present invention relates to information searching and retrieval, and more specifically, relates to methods for processing search queries.
Many World Wide Web sites and online services provide search engine programs (xe2x80x9csearch enginesxe2x80x9d) for assisting users in locating items of interest from a domain of items. For example, Web sites such as AltaVista(trademark) and Infoseek(trademark) provide search engines for assisting users in locating other Web sites, and online services such as Lexis(trademark) and Westlaw(trademark) implement search engines for assisting users in locating articles and court opinions. In addition, online merchants commonly provide search engines for assisting customers in locating items from an online catalog.
To perform a search using a search engine, a user submits a query containing one or more search terms. The query may also explicitly or implicitly identify a record field to be searched, such as the title, author or subject classification of the item. For example, a user of an online bookstore site may submit a query containing terms that the user believes appear in the title of a book. A query server program of the search engine processes the query to identify any items that match the query. The set of items identified by the query server program is referred to as the xe2x80x9cquery result,xe2x80x9d and is commonly presented to the user as a list of the located items. In the bookstore example, the query result would typically be the set of book titles that include all of the search terms, and would commonly be presented to the user as a hypertextual listing of these items.
When the user of a search engine misspells a search term within a query, such as by mistyping or failing to remember the term, the misspelled term commonly will not match any of the database terms that are encompassed by the search. In this event, many search engines will simply return a null (empty) search result. Presenting null search results to users, however, can cause significant user frustration. To reduce this problem, some search engines effectively ignore the non-matching term(s) during the search. This strategy has the disadvantage of failing to take into account potentially important information specified by the user, and tends to produce query results that contain relatively large numbers of irrelevant items.
The present invention addresses the foregoing problems by providing a system and method for correcting misspelled terms within search queries. The system includes a database of correlation data that indicates correlations between search terms. The correlation data is preferably based on the frequencies with which specific search terms have historically appeared together within the same query, and is preferably generated from a query log file. In one embodiment, each entry within the database (implemented as a table) comprises a keyword and a xe2x80x9crelated termsxe2x80x9d list, wherein the related terms list is composed of the terms that have appeared in combination with the keyword with the highest degree of frequency.
The spelling correction method is preferably invoked when a search query is submitted that includes at least one matching term and a least one non-matching term. Using the correlation database, a list of terms that are deemed to be related to the matching term or terms is initially generated. This may be accomplished, for example, by extracting the related terms list for each matching term, and if the query includes multiple matching terms, combining these lists into a single related terms list.
The related terms are then compared in spelling to the non-matching term(s) to identify any suitable replacements. The spelling comparisons are preferably performed using an anagram-type spelling comparison function which generates a score that indicates the degree of similarity between two character strings. If a related term with a sufficiently similar spelling to a non-matching term is found, the non-matching term is preferably automatically replaced with the related term. The user may alternatively be prompted to select the replacement term(s) from a list of terms. Once the non-matching term or terms have been replaced, the modified query is used to perform the search. The user is also preferably notified of the modification(s) made to the query.
An important benefit of the above-described spelling correction method over conventional spelling correction methods is that the selected replacement terms are considerably more likely to be the terms that were intended by the user. This benefit results from the above-described use of search term correlation data, and particularly correlation data that reflects historical query submissions. The method thereby increases the likelihood that the query result will contain items that are of interest to the user. Another benefit is that the method is well suited for correcting terms that do not appear in the dictionary, such as proper names of authors and artists and fanciful terms within titles and product names.
In accordance with another aspect of the invention, the correlation data is preferably generated such that it heavily reflects recent query submissions, and thus strongly reflects the current preferences of users. This may be accomplished, for example, by periodically generating a correlation table from a desired number (e.g., 12) of the most recent daily query logs. Using correlation data that heavily reflects recent query submissions further increases the likelihood that replacements made by the spelling correction process will be those intended by users.
One aspect of the invention is thus a computer-implemented method of predicting a correct spelling of a non-matching term in a multiple-term search query. The method comprises identifying a plurality of additional terms that are related to at least one matching term within the search query. A spelling of the non-matching term is compared to the spellings of the additional terms to determine whether any of the additional terms is sufficiently similar in spelling to the non-matching term to be deemed a candidate correctly-spelled replacement term for the non-matching term, Both matching and non-matching terms within the multiple-term search query are thus used to predict spelling corrections.
The additional terms are preferably identified using information about how frequently specific terms have appeared together (co-occurred) within the prior search queries of users. Other measures of search term correlation or xe2x80x9crelatednessxe2x80x9d may alternatively be used. For instance, the correlations between terms may be determined based on how frequently specific terms occur together within database records, or within particular fields of such records.