This specification relates to the detection of boilerplate content.
Web browsers use resources such as HyperText Markup Language (HTML) or eXtensible Markup Language (XML) documents to generate and display formatted content. A single such resource may include different content elements that discuss (or are otherwise associated with) different topics. For instance, a first block-level element of an HTML document may include textual content that discusses the primary topic of the document, while a second block-level element of the same HTML document may include advertising content, navigation-related content, disclaimer-type content, or other content that is of a different topic than the primary topic.
Query recommendation engines apply various query recommendation techniques to resources in order to identify and suggest other resources that might be of interest to a user. One example query recommendation technique is illustrated in and described in reference to FIG. 4 of PCT App. No. PCT/CN2009/000875, filed Aug. 4, 2009, and titled “Query Suggestions From Documents,” the substance of which is incorporated herein by reference. In another example of a query recommendation technique, a query recommendation engine may parse the text from a resource, apply stop word filtering, and identify a likely topic associated with frequently occurring words. Furthermore, the query recommendation engine may reference a look-up table to identify resources associated with the identified topic, and may provide information that identifies the resources to a client device, for presentation to a user.
When the user views a web page that discusses a particular topic, the query recommendation engine may determine that many other users who were interested in that topic also visited another, different web page, and typically spent a long amount of time viewing that different web page. Responsive to this determination, the query recommendation engine may suggest that the user visit the different web page, or that the user execute a search query using query terms that produce search results that link to the different web page. This specification refers to both of these types of suggestions as “query recommendations.”
When a resource includes multiple content elements that are each associated with a different topic, the resulting query recommendations generated by the search engine are less likely to be focused on the particular topic in which the user is most interested, and may therefore be unhelpful. For instance, a resource may include a navigation bar, and textual content that discusses a primary topic. If the navigation bar is used as the basis for a query recommendation, the user may be presented with query recommendations that relate to navigation between web pages (e.g., the terms “home,” “contact us,” “sign up,” “log in,” etc.), instead of query recommendations that relate to the primary topic. Thus, unless the search engine is able to identify and select the content elements of the resource which discuss the primary topic from among all of the content elements of the resource, and unless the search engine is able to apply the query recommendation techniques to these selected content elements in particular, the resulting query recommendation is unlikely to be considered helpful to the user.
One approach to focusing query recommendations involves filtering “boilerplate content” from a resource prior to applying a query recommendation technique. Boilerplate content is the content of a resource that is repeated in, or is common to, multiple resources or all resources of a particular web site, or portions of a resource which is not relevant to the main content of the resource. Specific examples of boilerplate content may include disclaimer or license information, navigation panels, copyright information, footer data, or promotional links. By filtering the boilerplate content of a resource before applying a query recommendation technique to the resource, the results of the query recommendation are more likely to be focused on the particular topic in which the user is most interested.
The detection of boilerplate content may occur at the server side, for example by a search engine that indexes most or all of the web pages of a particular web site. Because the boilerplate content may show up in a fixed area of all of the web pages of a particular web site, a search engine may compare the Document Object Model (DOM) tree of the indexed web pages of the web site to identify content elements that are common to many or all web pages of the web site, or that occur at the same or similar DOM tree node. Once identified as boilerplate content, these content elements may then be removed or demoted in the index of the search engine.