The World Wide Web (“web”) provides vast amounts of information that is accessible via web pages. Web pages can contain either static content or dynamic content. Static content refers generally to information that may stay the same across many accesses of the web pages. Dynamic content refers generally to information that is stored in a web database and is added to a web page in response to a search request. Dynamic content represents what has been referred to as the deep web or hidden web.
Many search engine services allow users to search for static content of the web. After a user submits a search request or query that includes search terms, the search engine service identifies web pages that may be related to those search terms. These web pages are the search result. To quickly identify related web pages, the search engine services may maintain a mapping of keywords to web pages. This mapping may be generated by “crawling” the web to identify the keywords of each web page. To crawl the web, a search engine service may use a list of root web pages to identify all web pages that are accessible through those root web pages. The keywords of any particular web page can be identified using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on.
These search engine services, however, do not in general provide for searching of dynamic content, which is also considered noncrawlable content. One problem with searching dynamic content is that it is difficult or impossible to directly obtain the schemas of the corresponding web databases without the cooperation of the web site that provides the web database. A schema defines the information or attributes that are stored in the database. For example, a web database for a seller of books may have a schema for its catalog of books (i.e., a web database) that includes a title attribute and an author attribute for each book. Without knowing the schema, it would be very difficult for a search engine service to crawl the content of a web database to determine what information is available for searching. Even if the schema of a web database were known, a search engine service would still need to determine how to crawl the web database to retrieve its content. Assuming that a search engine could retrieve the content of web databases, the search engine service would still need to identify when attributes of different schemas represent semantically equivalent attributes. For example, bookseller web sites may have catalogs that specify whether the book is paperback, hardcover, or compact disc. One bookseller's web site may name this attribute “type,” and another bookseller's web site may name the same attribute “format.” To allow effective searching of dynamic content across multiple web sites, a search engine service needs to know the meaning or semantics of the attributes of the web databases.
It would be desirable to have a technique that would automatically identify schemas associated with web databases and to identify attributes of different schemas that represent the same semantic content.