The World Wide Web (“web”) provides a vast amount of information that is accessible via web pages. Web pages can contain either static content or dynamic content. Static content refers generally to information that may stay the same across many accesses of a web page. Dynamic content refers generally to information that is stored in a web database and is added to a web page in response to a search request. Dynamic content represents what has been referred to as the deep web or hidden web.
Many search engine services allow users to search for static content of the web. After a user submits a search request or query that includes search terms, the search engine service identifies web pages that may be related to those search terms. These web pages are the search result. To quickly identify related web pages, the search engine services may maintain a mapping of keywords to web pages. This mapping may be generated by “crawling” the web to identify the keywords of each web page. To crawl the web, a search engine service may use a list of root web pages to identify all web pages that are accessible through those root web pages. The keywords of any particular web page can be identified using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on.
These search engine services, however, do not in general provide for searching of dynamic content, which is also considered noncrawlable content. One problem with searching of dynamic content is that the content of the web databases cannot be effectively retrieved and indexed for several reasons. One reason is that the content of multiple web databases may be too large to retrieve and index. Another reason is that the schema of web databases are hidden behind the search interface, that is, only the attributes of the search web page (and result web page) are exposed to a user. Another problem with searching of dynamic content is that the generated index would need to support both unstructured and structured queries. An unstructured query is a list of search terms that are generally used when searching for documents. For example, an unstructured query may be “Harry Potter Rowling.” A structured query is a list of attributes and attribute values that are generally used when searching a database. For example, a structured query may be “title=Harry Potter and author=Rowling.”
Considerable research has been conducted into developing a “metasearcher” that provides searching across multiple web databases. When the metasearcher receives a query, it selects the web databases that most likely contain relevant content, referred to as “source selection.” The metasearcher then translates the query into a suitable format for each of the identified web databases, referred to as “query translation.” For example, the metasearcher would need to understand how to map the attributes of the metasearcher's queries into site attributes of each selected web database. For example, the metasearcher may use an attribute named “format” to refer to the medium (e.g., paperback or hardback) of a book, whereas a web database may use an attribute named “type” to refer to the same data. Query translation needs to map the format attribute of the metasearcher to the type attribute of the web database. The metasearcher sends the translated queries to the selected web databases, referred to as “dispatching.” When the metasearcher receives the results of the searches, it integrates them into an overall result, referred to as “result integration.”
It would be desirable to have a technique for efficiently generating indexes for web databases that would allow for effective searching using both unstructured and structured queries.