The present invention generally relates to data processing. The invention relates more specifically to retrieving a document from among several electronic documents based on information not derived from the literal content of the document.
Hypertext systems now enjoy wide use. One particular hypertext system, the World Wide Web (xe2x80x9cWebxe2x80x9d), provides global access over public packet-switched networks to a large number of hypertext documents. The Web has grown to contain a staggering number of documents, and the number of documents continues to increase. The number of documents available through the Web is so large that to use the Web in a practical way almost always requires a search service, search engine, or similar service.
Certain search engines, however, have limited utility because the search results they produce include documents that are not relevant to the search query. In particular, many search engines return search results that list documents that are not genuinely related to the search query. One reason that search engines return such poor-quality results is that the search engines are easy to deceive. The search engines use xe2x80x9cspiderxe2x80x9d programs that xe2x80x9ccrawlxe2x80x9d to Web servers around the world, locate documents, index the documents, and follow hyperlinks to other documents. The index may comprise a list of all words encountered by the xe2x80x9cspiderxe2x80x9d in all the documents, in which each word in the list is associated with a reference to each of the documents that contains that word. Unfortunately, the xe2x80x9cspidersxe2x80x9d cannot discriminate among documents that genuinely use a particular word and documents that contain the word, but are really about something else.
For example, a Web document that contains sexually-oriented or pornographic material may also contain one or more words that are unrelated to the sexual material, but are intended to cause the document to be indexed by search engines under those words, thereby luring unsuspecting browsers to the document. A pornographic document that contains a decoy word intended to lure male viewers, such as xe2x80x9cCORVETTE,xe2x80x9d for example, followed by sexual material, would be indexed by a search engine under the word xe2x80x9cCORVETTExe2x80x9d. The decoy words may be embedded in invisible metatags or rendered in white characters on a white background, so as to be invisible when the document is displayed by the browser. This practice is called xe2x80x9cspammingxe2x80x9d a search engine or an indexing system. Searchers who submit a query to the search engine or indexing system that seeks information about the motion picture xe2x80x9cBambixe2x80x9d would receive the pornographic page in the search results. This is undesirable and has led to criticism of the utility of search engines and indexing systems.
As a result, the search results returned by the search engine often contain references to the documents that are totally unrelated, in terms of genuine content, to the scope of a search query. In the World Wide Web context, search engines that suffer from this problem include the Yahoo!(copyright) Web site, the Excite(copyright) Web site, the Infoseek(copyright) Web site, and others.
Accordingly, in this field there is a need for a system or mechanism that can eliminate extraneous references from search engine search results.
There is a particular need for a system or mechanism that can combat xe2x80x9cspammingxe2x80x9d of an indexing system or search engine system.
There is also a need for a mechanism that can associate words, search terms, or editorial matter, other than words appearing in the content of a document, with the document in an index.
There is a particular need for such a system that can carry out a search for a document based on words, search terms, or editorial matter other than the literal content of a group of documents.
The foregoing needs, and other needs and objects that will become apparent from the following disclosure, are fulfilled by the present invention, which comprises, in one aspect, a method of selecting electronic documents from among a plurality of electronic documents, the method comprising the steps of storing a tag word in an index in association with information identifying an electronic document, in which the tag word comprises data that is not derived from content of the electronic document; receiving a search query; modifying the search query to create a modified search query by adding to the search query a search term that references the tag word; and creating a set of search results by searching the index based on the modified search query.
One feature of this aspect is that the step of storing includes the steps of receiving data that indicates one or more tag words and criteria to be used to determine which of the plurality of documents should be associated with each of the one or more tag words; and storing, in the index, information associating each of the one or more tag words with the documents in the index that satisfy the criteria associated with the tag words. Another feature is that the step of storing includes the steps of receiving data that indicates one or more tag words and criteria to be used to determine which of the plurality of documents should be associated with each of the one or more tag words, and in which at least a portion of the data is expressed in a wildcard format; retrieving a location identifier of each of the documents that are indexed in the index; matching each location identifier to each of the criteria; and when one location identifier matches one of the criteria, storing, in the index, information associating such location identifier with one or more of the tag words.
In another feature, the step of storing includes the steps of receiving specifications of one or more of the documents that are indexed in the index, in which each of the specifications is associated with one or more tag words, and in which one of the specifications is expressed in a wildcard format; retrieving a location identifier of each of the documents that are indexed in the index; matching each location identifier to each of the specifications by interpreting the one of the specifications that is in the wildcard format according to one or more wildcard format rules; and when one location identifier matches one of the specifications, storing, in the index, information associating such location identifier with one or more of the tag words. In another feature, storing includes the steps of storing a hash value representing the tag word in a record of the index; and storing an indirect reference to information identifying one or more of the documents that contain the tag word.
Another aspect of the invention provides a method of restricting access to an electronic document that is stored among a plurality of documents, the method comprising the steps of storing a tag word in an index in association with information identifying the electronic document, in which the tag word indicates that access to the electronic document is restricted; receiving a search query that requests the electronic document; modifying the search query to create a modified search query by adding a search term that excludes from the modified search query all documents that contain the tag word; and creating a set of search results by searching the index based on the modified search query. One feature of this aspect is that the step of modifying comprises the step of modifying, automatically and using a software component of a browser, the search query to create a modified search query by adding a search term that excludes from the modified search query all documents that contain the tag word.
Another feature of this aspect is that the modified search query selects only those electronic documents that satisfy the original search query that also contain the tag word. A related feature is that the modified search query selects only those electronic documents that satisfy the original search query that do not contain the tag word.
In another aspect, the invention provides a method of processing queries that select an electronic document from among a plurality of documents, the method comprising the steps of storing a tag word in an index in association with information identifying the electronic document, in which the tag word indicates that access to the electronic document is restricted; receiving a search query that requests the electronic document; modifying the search query to create a modified search query by adding a search term that references the tag word; and creating a set of search results by searching the index based on the modified search query.
One feature of this aspect is that the modifying step further comprises using a software component installed in a browser to perform the steps of intercepting each search query entered using the browser; and modifying the search query that is intercepted to create the modified search query by adding the search term that references the tag word. A related feature is that the step of storing includes the steps of receiving specifications of one or more of the documents that are indexed in the index, in which each of the specifications is associated with the tag word; and storing, in the index, information associating one or more of the documents that are indexed in the index with the tag word, according to the specifications.
Still another aspect of the invention involves a method of constructing an index of a plurality of electronic documents for use in selecting electronic documents from among the plurality of electronic documents, comprising the steps of receiving data that indicates one or more tag words and criteria to be used to determine which of the plurality of documents should be associated with each of the one or more tag words, wherein the tag words are not derived from content of the electronic documents; storing a list of words that are within one document of the plurality of documents; and storing, in the index, information associating each of the one or more tag words with the one document when the one document satisfies the criteria associated with the tag words.
According to yet another aspect, there is a method of constructing an index of a plurality of electronic documents for use in selecting electronic documents from among the plurality of electronic documents, comprising the steps of receiving data that indicates one or more document property values and criteria to be used to determine which of the plurality of documents should be associated with each of the one or more document property values, wherein the document property values are not derived from content of the electronic documents; storing a list of words that are within one document of the plurality of documents; storing, in the index, information associating each of the one or more document property values with the one document when the one document satisfies the criteria associated with the document property values.
Another aspect of the invention is a method of selecting electronic documents from among a plurality of electronic documents, the method comprising the steps of storing a document property value in an index in association with information identifying an electronic document, in which the document property value comprises data that is not derived from content of the electronic document; receiving a search query; modifying the search query to create a modified search query by adding to the search query a search term that references the document property value; and creating a set of search results by searching the index based on the modified search query.
The invention also encompasses a computer system, a computer-readable medium, and a computer data signal embodied in a carrier wave that are configured to carry out the foregoing steps.
The foregoing summary is not intended to describe or summarize all features or aspects of the invention, which are set forth fully in the following description and claims.