1. Field of the Invention
The present invention relates to techniques for classifying documents, such as web pages. More specifically, the present invention relates to a method and an apparatus for classifying documents based on user inputs to facilitate subsequent queries involving the documents.
2. Related Art
Electronic commerce is a big business. The total volume of sales through the Internet in the United States is estimated to have reached nearly 70 billion dollars in 2004. A large portion of these sales resulted from search engine referrals. To obtain search engine referrals, a user typically enters keywords of interest into a search engine, and the search engine uses these keywords to search for and return “relevant” web pages to the user.
A large fraction of all searches are related to only a few commonly-occurring topics, in particular, entertainment-related topics, such as computer games, movies and music. Unfortunately, “spam pages” are a significant problem for searches related to these commonly occurring topics. A large percentage (sometimes 90%) of web pages returned by search engines for these commonly occurring topics are “spam pages,” which exist only to misdirect traffic from search engines. These spam pages are purposely designed to mislead search engines by achieving high rankings during searches related to common topics. However, these spam pages are typically unrelated to topics of interest, and they try to get the user to purchase various items, such as pornography, software, or financial services.
Spam pages are bad for search engine users because they make it hard for the users to retrieve the information that they need, which causes a frustrating search experience. Furthermore, spam pages are bad for search engines because they consume valuable web-crawling time and distort web page rankings in search engine results.
Unfortunately, it is very hard to determine which pages are spam pages because spam pages are purposely designed to achieve high rankings. They are also typically designed to circumvent automatic techniques for detecting spam pages. Consequently, existing techniques for automatically detecting spam web pages are generally ineffective.
Hence, what is needed is a method and an apparatus for effectively determining whether a web page is a spam page.