The present invention generally relates to data processing. The invention relates more specifically to identifying spoof documents among a large collection of electronic documents that are associated with, for example, an indexing system or search-and-retrieval system.
The Internet, often simply called xe2x80x9cthe Net,xe2x80x9d is a worldwide system of computer networks and, in a larger sense, the people using it. The Internet is a public, self-sustaining facility that is accessible to tens of millions of people worldwide. The most widely used part of the Internet is the World Wide Web, often abbreviated xe2x80x9cWWWxe2x80x9d or simply referred to as just xe2x80x9cthe Webxe2x80x9d. The Web is an Internet service that organizes information through the use of hypermedia. The HyperText Markup Language (xe2x80x9cHTMLxe2x80x9d) is used to specify the contents and format of a hypermedia document (e.g., a Web page).
In this context, an HTML file is a file that contains the source code for a particular Web page. A Web page is the image that is displayed to a user when a particular HTML file is rendered by a browser application program. Unless specifically stated, an electronic or Web document may refer to either the source code for a particular Web page or the Web page itself.
Each page can contain imbedded references to images, audio, or other Web documents. A user, using a Web browser, browses for information by following references, known as hyperlinks, that are embedded in each of the documents. The HyperText Transfer Protocol (xe2x80x9cHTTPxe2x80x9d) is the protocol used to access a Web document.
Through the use of the Web, individuals have access to millions of pages of information. However a significant drawback with using the Web is that because there is so little organization to the Web, at times it can be extremely difficult for users to locate the particular pages that contain the information that is of interest to them.
To address this problem, a mechanism known as a xe2x80x9csearch enginexe2x80x9d has been developed to index a large number of Web pages and to provide an interface that can be used to search the indexed information by entering certain words or phases to be queried. Indexes are conceptually similar to the normal indexes that are typically found at the end of a book, in that both kinds of indexes comprise an ordered list of information accompanied with the location of the information. Values in one or more columns of a table are stored in an index, which is maintained separately from the actual database table. An xe2x80x9cindex word setxe2x80x9d of a document is the set of words that are mapped to the document in an index. For documents that are not indexed, the index word set is empty.
Although there are many popular Internet search engines, they are generally constructed using the same three common parts. First, each search engine has at least one xe2x80x9cspiderxe2x80x9d that xe2x80x9ccrawlsxe2x80x9d across the Internet to locate Web documents around the world. Upon locating a document, the spider stores the document""s Uniform Resource Locator (URL), and follows any hyperlinks associated with the document to locate other Web documents. Second, each search engine contains an indexing mechanism that indexes certain information about the documents that were located by the spider. In general, index information is generated based on the contents of the HTML file. The indexing mechanism stores the index information in large databases that can typically hold an enormous amount of information. Third, each search engine provides a search tool that allows users to search the databases in order to locate specific documents that contain information that is of interest to them. To provide up-to-date information, the spiders continually crawl across the Internet to identify both new and updated documents for indexing. When a new or updated page is identified, the search engine makes corresponding updates to the database so as to continually provide up-to-date information.
Electronic documents include both visible text portions and non-visible text portions. The visible text portions of an electronic document includes the textual information that is contained in the document and which is displayed to a user when the electronic document is rendered using an application such as a Web browser. The non-visible text portions include the textual information that is contained in the electronic document but which is not displayed, and therefore is not visible to a user when the document is rendered using an application such as a Web browser. For example, FIG. 1A illustrates an HTML file 100 that contains both visible text portions and non-visible text portions. The visible text portions include text data 108 which is displayed when HTML file 100 is rendered by a browser application such as Netscape Navigator(copyright) or MicroSoft Internet Explorer(copyright). Alternatively, the non-visible text portions include title data 104 and comment data 106. Also depicted in FIG. 1A. are HTML tags 102 which represent codes that are used by browser applications to determine what information is to be made visible and how the visible information is to be structured and formatted when displayed. Title data 104 and comment data 106, also referred to as metadata, include textual information, referred to herein as xe2x80x9cmetawordsxe2x80x9d, that may be included in an HTML file but which is not displayed when the document is rendered by a browser application. For example, FIG. 1B illustrates Web page 110 as seen by rendering HTML file 110 through the use of a browser application. As depicted, upon rendering HTML file 100, the visible text portion (text data 108) is displayed in Web page 110 and is therefore visible to the user. Alternatively, the non-visible text portions (title data 104 and comment data 106), are not displayed in Web page 110 and therefore are not visible to the user.
Different search engines use different techniques to extract and index information contained on the Internet. For example, some search engines use indexing mechanisms that index every single word in each document, while others index only the first xe2x80x9cNxe2x80x9d number of words in each document.
Because certain non-visual portions of documents typically provide an accurate description of the visual contents of the document, many search engines index not only the visual text portion but also sections of the non-visible text portions. For example, the metadata associated with the tag  less than title greater than  typically include title information that concisely and accurately describes the subject matter or contents of the particular document. Similarly, the metadata associated with the tag  less than comment greater than  may include comment information that relates to the subject matter or contents of the particular document. An illustrated example is provided by title data 104 and comment data 106 of FIG. 1A. Thus, by indexing a document based on the metadata that is associated with certain tags contained therein, the documents can be indexed in a way that accurately reflects its contents.
Because the results of a query search are highly dependent on the indexes that are used to process the query, it is critical that the indexes used in a search be accurate as possible. Therefore, it is important that the indexing mechanisms index each document based on those words or terms that most accurately describe the contents of the document. However, for certain Web marketers and site designers, there is a desire or motivation to have as many xe2x80x9chitsxe2x80x9d on their Web pages as possible. Thus, to increase the number of hits on a particular Web page, certain Web page developers have employed a technique known as xe2x80x9cxe2x80x9cspamdexingxe2x80x9d to cause numerous non-representative index entries to be generated for their Web pages.
In this context, the term spamdexing is defined as adding additional words or terms to a document in order to affect how the document is indexed or otherwise treated. Spamdexing may be performed by adding unrelated visible text to a document, and/or by adding non-visible metadata. Documents in which spamdexing has been applied are generally referred to as xe2x80x9cspoofxe2x80x9d documents.
Frequently, the added words or terms do not provide an accurate description of the contents or subject matter of the particular document, but are added to cause the document to be found by searches in relatively unrelated topics. Alternatively, the added words may accurately reflect the content of the document, but be added in a way that causes the document to be given a higher xe2x80x9crankingxe2x80x9d than it would otherwise deserve.
One of the most common forms of spamdexing is commonly known as xe2x80x9cword-stuffingxe2x80x9d, in which a particular word is embedded within a page dozens or even hundreds of times to ensure that the page always appears at or near the top of the list of search engine results for searches that contain that word. For example, a personal home page that describes the antics of somebody""s pet dog may have embedded therein thousands of instances of the word xe2x80x9cdogxe2x80x9d to ensure that the page will be highly ranked in the results of all queries that include the word xe2x80x9cdogxe2x80x9d.
Another form of spamdexing is commonly known as xe2x80x9cbait-and-switchxe2x80x9d, in which a page is loaded with some popular search word such as xe2x80x9csexxe2x80x9d or xe2x80x9cfreexe2x80x9d or xe2x80x9cprizexe2x80x9d even though that particular word has nothing to do with the contents of the site.
Based on the foregoing, it is highly desirable to provide a mechanism that can detect spoof documents. It is also highly desirable to accurately index the contents of non-spoof documents.
The foregoing needs, and other needs and objects that will become apparent from the following description, are achieved in the present invention, which comprises, in one aspect, a method for indexing electronic documents that include one or more visible text portions and one or more non-visible text portions, comprising the computer-implemented steps of identifying an electronic document; selecting from the electronic document a set of words that are associated with a particular tag type, wherein the set of words is selected from words within the one or more non-visible text portions of the electronic document; comparing each word in the selected set of words with words in the one or more visible text portions of the electronic document; and determining an index word for the electronic document based on matches between words in the selected set of words and words in the one or more visible text portions of the electronic document.
One feature involves determining a plurality of selected tag types, the plurality of selected tag types including the particular tag type; and selecting from the electronic document multiple sets of words, wherein each set of the multiple sets of words is associated with one of the selected tag types.
Another feature involves performing the step of determining an index word set for the electronic document by determining a percentage of matches based a particular number of matches that are found between words in said selected set of words and words in said one or more visible text portions of said electronic document; determining a minimum match percentage for the set of words; and if said percentage of matches is below said minimum match percentage, then generating the index word set based only on those words in said selected set of words for which there is a corresponding match in said one or more visible text portions of said electronic document.
According to another aspect, a method for indexing an electronic document is provided, the method comprising the computer-implemented steps of identifying a first portion of the electronic document that is not displayed when the document is rendered; identifying a second portion of the electronic document that is displayed when the document is rendered; performing a comparison between words from the first portion and words from the second portion; determining a set of words from said first portion to use to index said electronic document based on said comparison; and indexing said electronic document based on said set of words.
The invention also encompasses a computer-readable medium and an apparatus configured to carry out the foregoing steps.