The World Wide Web (“web”) provides vast amounts of information that is accessible via web pages. Web pages can contain either static content or dynamic content. Static content refers generally to information that may stay the same across many accesses of the web pages. Dynamic content refers generally to information that is stored in a web database and is added to a web page in response to a search request. Dynamic content represents what has been referred to as the deep web or hidden web.
Many search engine services allow users to search for static content of the web. After a user submits a search request or query that includes search terms, the search engine service identifies web pages that may be related to those search terms. These web pages are the search result. To quickly identify related web pages, the search engine services may maintain a mapping of keywords to web pages. This mapping may be generated by “crawling” the web to identify the keywords of each web page. To crawl the web, a search engine service may use a list of root web pages to identify all web pages that are accessible through those root web pages. The keywords of any particular web page can be identified using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on.
These search engine services, however, do not in general provide for searching of dynamic content, which is also considered non-crawlable content. Many web pages contain dynamic content generated from a structured source (e.g., a relational database). When a web page containing such dynamic content is generated, the structured data of the underlying structured source is encoded in the web page in an unstructured or semi-structured manner. One problem with searching such dynamic content is that it is difficult to identify the schemas of the corresponding structured source from the web pages. A schema defines the information or attributes that are stored in the underlying structured source. Because of this difficulty, the querying of web pages with such dynamic content often provides unsatisfactory results.
Attempts have been made to identify the schema of the dynamic content of web pages so that the content may be transformed into a more structured format to facilitate searching. The extraction of information from web pages and organization of it in a structured format is performed by programs referred to as “wrappers.” It can be time-consuming to manually generate a wrapper for web pages of a web site. Thus, it is impractical to manually generate wrappers for the millions of web pages of the thousands of web sites that provide dynamic content.
Some automatic wrapper “induction” or generation systems have been developed. Wrapper induction is the process of learning the schema of the dynamic content of a web page and generating a wrapper to extract the data from the web page and store the extracted data in a structured format identified by the schema. These automatic wrapper induction systems trade off effectiveness for expressiveness of the wrapper. Effectiveness refers to how accurate a wrapper is at extracting content from web pages that are not used in the wrapper induction process but that share the same “template.” A wrapper induction system generates a wrapper for a template using a training set of web pages. The wrapper is then used to extract data from web pages that share the same template. Expressiveness refers to the scope of web pages that can be processed by a wrapper as identified by the wrapper's template. To make a wrapper more expressive, the wrapper induction systems generally introduce wildcards (e.g., “*”) into the wrappers so that more web pages will be within the scope of a wrapper. In general, as the expressiveness of a wrapper increases, its effectiveness, however, decreases, and vice versa.
To provide an acceptable trade-off between effectiveness and expressiveness, the typical wrapper induction systems divide the training web pages into clusters according to templates representing the organization of the dynamic content on the web pages. Thus, web pages with a similar organization (i.e., having the same template) are clustered together. These wrapper induction systems can automatically generate wrappers for web pages within a cluster. Since the web pages of a cluster are similar, such wrappers can use limited wildcards to increase expressiveness and still attain acceptable effectiveness.
The accuracy of a wrapper generated by such typical wrapper induction systems, however, depends in large part on the accuracy of correctly clustering web pages that have the same template. Some wrapper induction systems simply cluster web pages based on similarity between the URLs of the web pages. This simple approach for clustering is appropriate when a web site stores web pages that use the same template in the same subdirectory of the web site. In such a case, their URLs have the same prefix to indicate the location of the subdirectory. Many web sites, however, use a much more complex approach when defining URLs for web pages. As a result, web pages with similar URLs may have very different templates, and web pages with very different URLs may have very similar templates. Thus, it can be very difficult to accurately cluster web pages based on similarity of their organization, resulting in wrappers with an unacceptable tradeoff between effectiveness and expressiveness.