The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
Web pages form the core input dataset for all Internet search and advertising companies, and this necessitates the development of algorithms for the proper analysis of web pages. Understanding the structure and content of a web page is useful in a variety of contexts.
A basic problem for an Internet application that automatically processes the content of web pages is determining which portion(s) of a web page have content that is meaningful to the application, and which to disregard. For example, a search engine automatically determines which web pages best match a user query. The basic premise of all search engines today is that a web page that contains all (or most) of the terms specified in a query string is a good candidate as an answer to the search query. However, when textual content that matches terms in a query is located in certain portions of a web page such as in an advertisement or copyright notice, the web page is not necessarily relevant to the user's search. Consider, for instance, a web page containing lyrics of a song X, but with links at the bottom of the page to other pages containing fragments from lyrics of other popular songs Y and Z. A search query for Y and Z will match this page, since both Y and Z are mentioned on the page; clearly, however, the page does not contain the information the user is looking for. Similarly, Y and Z may be text in the advertisements appearing on the web page. In another instance, a search for “copyright for company X” ought to return the main legal web page in the website for company X, and not every page in that website that has a small “copyright” disclaimer at the bottom.