The World Wide Web contains huge amounts of knowledge that can provide substantial benefits to those who are able to find desired information. Information extraction is a technology directed towards the discovery and management of such web-based knowledge.
One information extraction task is directed towards extracting structured Web information of Web objects, typically comprising real-world entities including people, organizations, locations, publications, and products. Such Web object extraction can be used to understand the visual layout structure of a webpage, including for labeling the HTML elements of a page with attribute names of an entity, e.g., a business name for one entity on the page, a business address for another.
One labeling mechanism that leverages the result of understanding the page structure for use in free text segmentation and labeling is in the form of a joint model employing a Hierarchical Conditional Random Fields (HCRF) model and an extended Semi-Markov Conditional Random Fields (Semi-CRF) model. This joint model is a top-down model, in which the HCRF model determines the structure in one decision, and the Semi-CRF model makes use of this structure decision along with a suitable source of information (e.g., a gazetteer for location labeling) to make a final labeling decision.
However, there are drawbacks to this top-down technique. For example, business names are often difficult to identify on a webpage with such a model. Any improvement to the understanding of webpage content is desirable.