The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
Users of network-connected computers and mobile devices, such as personal digital assistants, may request information by formulating a search query and submitting the search query, for example, to an Internet search engine. Internet search engines are often used to search (i.e. query) the Internet for specific content that is of interest to the user. Queries are commonly accomplished by entering keywords into a search field, provided by the Internet search engine, that relate to the specific interest of the user. In response to the submission of a query (i.e., performing a “search”), Internet search engines provide a list of search results, also referred to as “hits”, that typically contain hyperlinks (or simply links) referencing the desired web pages. Upon clicking on such links in a web browser, a user can navigate to the “landing” or destination pages pertaining to the issued query.
With the growth of the Internet and the World Wide Web, the corpus of searchable resources indexed (i.e., extracted) by Internet search engines has increased dramatically. As such, a given query may return a overwhelmingly long list of search results. In such instances, the Internet search engine may further order the search results based on a determined relevancy of each of the search results. For example, one common approach positions, or displays, the most relevant search results near the top of the search results page (SRP) for facilitating selection by a user. Accordingly, users tend to focus on these top results, often to the exclusion of results presented further down the page, thereby resulting in increased traffic to the top results.
Commercial web sites typically receive revenue from advertisers based on page views. Therefore there exists an incentive to increase web traffic to the commercial web sites in order to increase advertising revenue for the web site operators. As a result, search engine optimizers (SEOs), services which are paid to increase the prominence/position of a subscriber's web page within search results for purpose of increasing traffic and thereby, revenue, often attempt to manipulate the ranking of the subscribers' web pages by artificial techniques which take advantage of certain known features used by search engines. One technique utilized by SEOs to promote target web pages involves creating a surplus of inlinks (incoming links) that “point” to (i.e., reference) the destination page, based on the assumption that the web pages that are referenced more frequently are typically considered by Internet search engines as being of higher quality. In addition, the perceived relevancy of such web pages are also enhanced by inundating the “anchor texts” (i.e., words appearing within clickable links on web pages) associated with the hyperlink references with frequent keywords that web users issue in their search queries.
Artificially promoted web pages are often of low quality, i.e. low relevance, with regards to user interest. Often termed “web spam”, the purpose of artificially promoted pages is to “trick” search engines into directing traffic to the artificially promoted web pages so that users are encouraged to navigate to their low-quality web pages often for, but not limited to, commercial purposes. Artificially promoted web pages may include content ranging from adult content to commercial web pages of legitimate companies. Artificially promoted web pages will be referred to herein after as “undesirable web pages”.
Currently, procedures for detecting undesirable web pages rely heavily on human editors. The editors are employed to analyze resources indexed by a search engine in order to identify representative examples of various categories of undesirable web pages (e.g., web spam, adult content, etc.) among the indexed resources. Detection algorithms are then generated based on the results of the analysis and implemented for subsequent detection of similar undesirable resources. This practice of identifying and demoting undesirable resources is extremely complex as well as cumbersome, costly, and sometimes unreliable. Search engines strive to improve the detection performance of these algorithms (i.e. reducing the occurrences of false positives and false negatives output by the algorithms).