Current approaches for harvesting content from web pages face the situation that each site and page can have a unique layout comprised of multiple components in various places, such as content sections, ads, frames, columns, content boxes, page content that is divided into sub-divisions or sub-objects, web page sub-components, content articles that run or continue across several sections or pages, etc. In response to this situation, present tools for crawling and mining content from such pages have sometimes needed to be specifically programmed for the unique layout and structure of each site and/or page so that they know where the content of interest is located in the layout for that site. In other words, the programming tells the tool what parts of the page represent content to keep, and content to discard, and what sections of the page correspond to various types of content. For example, when mining a news site, it may be desired to collect and operate on the body text of the news articles in the site, but to ignore the ads and other sidebar content, etc. Existing approaches in this regard have involved programming specific mining agents, or programming a general agent with specific rules or templates. Moreover, such programming often need to be kept up to date for each site and/or page as the content and structure of the sites and pages change over time. This can be a labor intensive process that does not scale well to mining very large numbers of sites/pages with differing layouts and structure. Another existing solution uses statistical or natural language processing methods, or machine learning methods, to try to figure out automatically which parts of sites and pages should be kept and which should be discarded, and which parts of sites and pages correspond to various types of content.