Many search engine services, such as Google and Overture, provide for searching for information that is accessible via the Internet. These search engine services allow users to search for display pages, such as web pages, that may be of interest to users. After a user submits a search request (also referred to as a “query”) that includes search terms, the search engine service identifies web pages that may be related to those search terms. To quickly identify related web pages, a search engine service may maintain a mapping of keywords to web pages. The search engine service may generate this mapping by “crawling” the web (i.e., the World Wide Web) to extract the keywords of each web page. To crawl the web, a search engine service may use a list of root web pages and identify all web pages that are accessible through those root web pages. The keywords of any particular web page can be extracted using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on. The search engine service may calculate a score that indicates how to rank the web pages based on the relevance of each web page to the search request, web page popularity (e.g., Google's PageRank), and so on. The search engine service then displays to the user the links to those web pages in the order indicated by the scores.
A web page may contain information about various types of objects such as products, people, papers, organizations, and so on, which are referred to as “web objects.” For example, one web page may contain a product review of a certain model of camera, and another web page may contain an advertisement offering to sell that model of camera at a certain price. As another example, one web page may contain a journal article, and another web page may be the homepage of an author of the journal article. A person who is searching for information about an object may need information that is contained in different web pages. For example, a person who is interested in purchasing a certain camera may want to read reviews of the camera and to determine who is offering the camera at the lowest price.
To obtain such information, a person would typically use a search engine to find web pages that contain information about the camera. The person would enter a search query that may include the manufacturer and model number of the camera. The search engine then identifies web pages that match the search query and presents those web pages to the user in an order that is based on how relevant the content of each web page is to the search query. The person would then need to view the various web pages to find the desired information. For example, the person may first try to find web pages that contain reviews of the camera. After reading the reviews, the person may then try to locate a web page that contains an advertisement for the camera at the lowest price.
To make it easier to access information about web objects, many systems have been developed to extract information about web objects from web pages. Web pages often allocate a record for each object that is to be displayed. For example, a web page that lists several cameras for sale may include a record for each camera. Each record contains attributes of the object such as an image of the camera, its make and model, and its price. The extraction of such information can be difficult because web pages contain a wide variety of layouts of records and layouts of attributes within records.
Users can submit queries to a search system to locate information about web objects of interest in a manner similar to how users submit queries to locate web pages of interest. When a user submits a query to locate web object information, traditional database-type retrieval techniques can be used to search for a web object with attributes that match the query. These traditional techniques when applied to web objects are not particularly effective because they assume that the underlying data is reliable. The extraction of web object information can, however, be unreliable for several reasons. First, it can be difficult to precisely identify the record or the portion of a web page that corresponds to a web object. For example, it can be difficult to determine whether adjacent text represents data for the same object or two different objects. If a record is not identified correctly, then the identification of the attributes will likely not be correct. Second, even if the identification of a record is correct, the attributes of the record may still be incorrectly identified. For example, it can be difficult to determine whether a certain number in a record corresponds to the weight of the product, a dimension of the product, and so on. Third, the data source that provides the web page may itself provide unreliable data. For example, a web page advertising a product may simply have wrong information such as the wrong manufacturer for a certain model number of television. Because of this unreliability of extracted web object information, systems that perform searches based on extracted information on web objects often do not provide satisfactory results.