For most users, a search of a database for documents related to a particular topic begins with the formulation of a search query for use by a search engine. The search engine then identifies documents that match the specifications that the user sets forth in the search query. These documents are then presented to the user, usually in an order that attempts to approximate the extent to which the documents match the specifications of the search query.
In its simplest form, the search query might be no more than a word or a phrase. However, such simple search queries typically result in the retrieval of far too many documents, many of which are likely to be irrelevant. To avoid this, search engines provide a mechanism for narrowing the search, typically by allowing the user to specify some Boolean combination of words and phrases. More complex search queries allow a user to specify that two Boolean combinations be found within a particular distance, usually measured in words, from each other. Known search queries can also provide wildcard characters or mechanisms for including or excluding certain word variants.
Regardless of its complexity, a search query is fundamentally no more than a user""s best guess as to the distribution of alphanumeric characters that is likely to occur in a document containing the information of interest. The success of a search query thus depends on the user""s skill in formulating the search query and in the predictability of the documents in the database. Hence, a search query of this type is likely to be most successful when the documents in the database are either inherently structured or under editorial control. Because of the necessity for thorough editorial review, such databases tend to be either somewhat specialized (for example databases for patent searching or searching case law) or slow to change (for example, CD-ROM encyclopedias).
Because of its distributed nature, the internet offers a breadth of up-to-date information. However, documents posted on the internet are often posted with little editorial control. As a result, many documents are plagued with inconsistencies and errors that reduce the effectiveness of a search engine. In addition, because the internet has become an advertising medium, many sites seek to attract visitors. As a result, proprietors of those sites pepper their sites with invisible (to the reader) words, as bait for attracting the attention of search engines. The presence of such invisible words thwarts the search engine""s attempt to judge the relevancy of a document solely by the distribution of words in the document.
The unreliability associated with many documents on the internet poses a difficult problem when a search engine attempts to rank the relevance of retrieved documents. Because all the search engine knows is the distribution of words, it can do no more than indicate that the distribution of words in a document does or does not match the search query more closely than the distribution of words in another document. This can result in such a prolixity of search results that it is impractical to examine them all. Moreover, because there is no absolute standard for relevance on the internet, there is no assurance that the most highly ranked document returned by a search engine is even relevant at all. It may simply be the least irrelevant document in a collection of irrelevant documents.
Attempts have been made to improve the searchability of the internet by having human editors assess the reliability and relevance of particular sites. Addresses to those sites meeting a threshold of reliability are then provided to the user. For example, major publishers of encyclopedias on CD-ROM provide pre-selected links to internet sites in order to augment the materials provided on the CD-ROM. However, these attempts are hampered by the fact that internet sites can change, both in content and in address, overnight. Thus, a reviewed site that may have existed on publication of the CD-ROM may no longer exist when a user subsequently attempts to activate that link.
It is apparent that the dynamic and free-form nature of the internet results in a highly diversified and current storehouse of reference materials. However, the uncontrolled nature of documents on the internet results in an environment that is not readily searchable in an efficient manner by a conventional search engine.
In accord with the method and apparatus of this invention, the relevance of documents retrieved by a search engine operating in an uncontrolled public database is considerably improved by also searching a controlled database, and by using the search results from the controlled database to assess the relevance of the documents retrieved from the public database.
The method of the invention includes the identification and ranking of a plurality of candidate documents on the basis of the similarity of each of the candidate documents to a user-query.
This method includes the step of parsing the user-query to generate both a list of one or more query-words and a distribution, within the user-query, of the query-words in that list. The user-query can be provided by the user or it can be an excerpt of text selected from a document referred to by the user.
The importance of each query-word in the user-query is then assessed on the basis of the frequency with which the query-word occurs in a database of candidate documents. In an optional feature of the invention, the step of parsing the query includes the step of providing additional query-words, referred to as derivative query-words, which are associated with the original query-words provided by the user. These derivative query-words are accorded lesser importance in the identification of candidate documents than are original query-words.
A candidate document that has clusters of query-words is intuitively of more relevance to a user-query than is a candidate document with isolated occurrences of query-words. The former is likely to contain a coherent discussion of the subject matter of the user-query whereas the latter may refer to the subject matter of the user-query only tangentially. In some cases, an isolated occurrence of a query-word may be no more than a typographical error.
The method of the invention exploits the importance of query-word clustering to the identification of candidate documents similar, or relevant, to a user-query by evaluating the similarity of a candidate document to the user-query on the basis of the distribution, or clustering, of query-words within the particular candidate document. In a preferred embodiment, the step of evaluating this measure of document similarity, referred to as a xe2x80x9cdocument conductance,xe2x80x9d includes the step of determining the concentration, or distribution, of query-words in the candidate document. A document in which there exist regions of high concentration, or clustering, of query-words is indicative of a document that is similar to the query. Such a candidate document is therefore assigned a document conductance indicative of greater similarity to the user-query than a candidate document having fewer such query-word clusters.
Having evaluated the similarity of a large number of candidate documents to the user-query, the method of the invention now proceeds with an evaluation of the distribution, or clustering, of the query-words in the individual sentences that make up the candidate document. The similarity of a particular sentence to the user-query depends upon the concentration of query-words in a particular sentence.
In one preferred embodiment, the similarity of a particular sentence is measured by a quantity that is responsive to, or depends upon, the ratio of the overall concentration of the query-word in the plurality of candidate documents to the concentration of the query-word in the sentence. Where there are several query-words, this quantity, which is referred to as the xe2x80x9cposition-independent sentence similarity,xe2x80x9d is summed over all query-words occurring in the particular sentence.
The location, within a document, of a sentence containing one or more query-words is potentially indicative of the importance or relevance of that document. In particular, if a sentence having one or more query-words is located near the beginning of the document, that sentence may be part of an introduction that sets forth, using a relatively small number of words, the general subject matter of the document. Conversely, if a similar sentence is located near the end of the document, it may be part of a concluding section that recapitulates the main points of the document.
In an optional feature of the invention, documents containing such content-rich text are identified by assigning a quantity to the sentences making up the candidate document that depends on the position of the sentence within the document. This quantity,referred to as the xe2x80x9cposition-dependent sentence similarity,xe2x80x9d is obtained by weighting the contribution made by each sentence to the calculation of the position-independent sent similarity by a quantity that depends on the position of the particular sentence within the document.
Candidate documents derived from a public database such as the internet are often not subject to stringent editorial review. Thus, in searching such a public database for candidate documents similar, or relevant, to a user-query, it is advantageous to provide an authoritative database to use as a standard against which the similarity of candidate documents from the public database is assessed. Such a database typically includes a multiplicity of reference materials published only after having been subjected to editorial scrutiny.
In one method according to the invention, candidate documents are identified in both an authoritative database and in a public database. In this method, the foregoing steps are applied to both candidate documents from the authoritative database and candidate documents from the public database. The resulting search results include documents from both the public database and the authoritative database.