1. Field of the Invention
Aspects of the present invention relate generally to a method for pulling apart a webpage into visually and semantically cohesive pieces.
2. Description of Related Art
As web usage is increasing, the creation of content is becoming more mature, and its presentation more sophisticated. As is known in the art, presentation involves placing different pieces of information on a webpage—each serving a different purpose—in a manner that appears coherent to users who browse the webpage. These pieces have carefully placed visual and other cues that cause most users to subconsciously segment the browser-rendered page into semantic regions, each with a different purpose, function, and content.
Most of the techniques currently used to attempt webpage segmentations involve simple rule-based heuristics. The heuristics typically utilize several features present on a webpage, including geometry-related features, and apply the rules in a greedy fashion to produce the segments. While a heuristic approach might work well on small sets of pages and for the specific tasks for which it was designed, it has several problems.
For example, it is hard to automatically adapt the heuristics to keep abreast of the inherently dynamic nature of presentation styles and content types on the web. Furthermore, combining multiple heuristics that work well on different types of pages into a single all-in-one heuristic is a manually intensive trial-and-error effort. Moreover, since heuristics are inherently greedy, the solutions they produce tend to be local minima.
Thus, it would be desirable to use a more principled, generalized approach to webpage segmentation.