The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
Webpages form the core input dataset for all Internet search and advertising companies, and this necessitates the development of algorithms for the proper analysis of webpages. Understanding the structure and content of a webpage is useful in a variety of contexts.
A basic problem for an Internet search engine is that of finding good results for a search query. The basic premise of all search engines today is that a webpage that contains all (or most) of the terms specified in a query string is a good candidate as an answer to the search query. However, this idea is violated in a large number of cases. Consider, for instance, a webpage containing lyrics of a song X, but with links at the bottom of the page to other pages containing fragments from lyrics of other popular songs Y and Z. A search query for Y and Z will match this page, since both Y and Z are mentioned on the page; clearly, however, the page does not contain the information the user is looking for. Similarly, Y and Z may be text in the advertisements appearing on the webpage. In another instance, a search for “copyright for company X” ought to return the main legal webpage in the website for company X, and not every page in that website that has a small “copyright” disclaimer at the bottom.
As another example, a New York Times webpage may have a headline bar, sports, news items, and a copyright notice. A user may search for keywords such as “New York Times legal information.” There is probably some webpage on the New York Times web site that provides much legal information. But the keywords may also match a news page that does not provide the relevant search results. To provide more meaningful information about a webpage, it is useful to figure out that the webpage is mainly about the news item, and that the other content available on that webpage is slightly relevant but not the most important in that webpage. Thus, splitting up a webpage into different sections is useful to provide more relevant search results.
The main idea illustrated by these examples is simply that query terms should be matched only to the “main content” of a webpage, and not to all the side information and “look-and-feel” aspects of the webpage. This demonstrates the necessity of breaking up a webpage into blocks, or segments, each of which is a separate semantic unit of a webpage that is unrelated to the others. A block is a provisional segment, and a block can be a segment, or multiple blocks may be merged into a segment. Blocks may also be further divided into multiple blocks.
A segmentation operation could put song lyrics and links to other lyrics pages in separate segments, create segments for ads or copyright notices on the page, and so on. A webpage may be divided into different segments such as the main content, navigation bar, advertising, footer, and so on.