Users are spending more time on the Internet performing more and more activities from online shopping to banking; meanwhile, Internet sites are getting more complex in design and content. For example, one common way of performing activities on the Internet is through Webpages, which are hyper text markup language (HTML) pages provided by a server.
Websites are becoming more cluttered with guides and menus attempting to improve user's efficiency, but instead these guides and menus often end up distracting from actual content of interest and can be less informative and can include, typically, unrelated material. Further, these guides and menus can complicate Web content extraction and Web printing. These “features” may include script and flash-driven animations, navigation menus, pop-up advertisements, obtrusive banner advertisements, unnecessary images or links to related stories scattered around the Webpage and so on.
Providing user friendly experience for Web printing can very much depend on extracting desired information from semi-structured HTML pages which include these guides and menus. One solution to this problem proposes a template-independent method for Web content extraction based on some visual features. Another solution proposes using a global heuristic of maximum subsequence segmentation based on word-level local classifiers and applying it to the domain of Websites. However, these methods may not accurately extract the Web content and therefore are not amicable to Web printing because they work on identifying only the boundary of the text-body and this can result in extracting unwanted content, such as link-lists related to stories and advertisements that may exist within the identified boundary. Also, these methods may not detect paragraph separation within the text-body. Furthermore, the second solution is dependent on content domains and writing languages.
The drawings described herein are for illustration purposes only and are not intended to limit the scope of the present disclosure in any way.