As an information repository, the immense scale and wide spread use of the World Wide Web has made it not only a source where information is found, but also a destination where information is published. These dual forces have enriched the Web with all kinds of data, much beyond the conventional page view of the Web as a corpus of HTML pages, or “documents.” Consequently, the Web is now a rich collection of data-rich pages, on the “surface Web” of static URLs (e.g., personal or company homepages) as well as the “deep Web” of database-backed contents (e.g., flights from aa.com).
While data are proliferating, however, the ability of current search tools to effectively access such data (e.g. finding the customer service number of Amazon) using traditional document retrieval techniques is limited. Often times, it is difficult to formulate the right keywords for hitting promising documents. Furthermore, one has to sift through the returned documents one by one to identify the desired data.
Conventional search engines typically use keyword searching to find relevant documents or pages. The documents may be ranked according to a calculated estimate of relevance. Such conventional techniques are optimized towards finding relevant documents of the keywords, and not towards directly finding data as answers to specific questions). Given the documents returned by conventional search engines, a user has to sift through the documents one by one to look for the specific answer. Moreover, while the high ranked documents may specifically address topics about which the user is interested in and have all the keywords used in the query, they may not contain information or only contain irrelevant or inaccurate information with regard to the specific datum the user wants. The specific datum or answer the user is looking for may appear in many lower ranked documents instead. Depending on the relative rankings of these pages, the user inputting the query may have to sift through many pages before finding the correct datum or may never visit the low rank documents, and thus miss the sources having the information he or she actually is seeking.
Chakabarti et al., “Ranking Objects by Exploiting Relationships: Computing Top-K over Aggregation,” SIGMOD 2006, June 2006, describes an “Object Finder” that views an “object” as the collection of documents that are related with it and therefore scores an “object” by performing aggregation over document scores. The relevance score of an object is an aggregate of the full text search scores (i.e., the keyword match scores returned by full text search) of all related documents containing the query keywords.
A more effective mechanism is desired to search over data or entities on the Web or other database of documents or pages.