Web pages accessible via the Internet contain a vast amount of information. A web page may contain information about various types of objects such as products, people, papers, organizations, and so on. For example, one web page may contain a product review of a certain model of camera, and another web page may contain an advertisement offering to sell that model of camera at a certain price. As another example, one web page may contain a journal article, and another web page may be the homepage of an author of the journal article. A person who is searching for information about an object may need information that is contained in different web pages. For example, a person who is interested in purchasing a certain camera may want to read reviews of the camera and to determine who is offering the camera at the lowest price.
To obtain such information, a person would typically use a search engine to find web pages that contain information about the camera. The person would enter a search query that may include the manufacturer and model number of the camera. The search engine then identifies web pages that match the search query and presents those web pages to the user in an order that is based on how relevant the content of the web page is to the search query. The person would then need to view the various web pages to find the desired information. For example, the person may first try to find web pages that contain reviews of the camera. After reading the reviews, the person may then try to locate a web page that contains an advertisement for the camera at the lowest price.
Web search systems have not been particularly helpful to users trying to find information about a specific product because of the difficulty in accurately identifying objects and their attributes from web pages. Web pages often allocate a record for each object that is to be displayed. For example, a web page that lists several cameras for sale may include a record for each camera. Each record contains attributes of the object such as an image of the camera, its make and model, and its price. Web pages contain a wide variety of layouts of records and layouts of attributes within records. Systems identifying records and their attributes from web pages are typically either template-dependent or template-independent. Template-dependent systems may have templates for both the layout of records on web pages and the layout of attributes within a record. Such a system finds record templates that match portions of a web page and then finds attribute templates that match the attributes of the record. Template-independent systems, in contrast, typically try to identify whether a web page is a list page (i.e., listing multiple records) or a detail page (i.e., a single record). The template-independent system then tries to identify records from meta-data of the web page (e.g., tables) based on this distinction. Such systems may then use various heuristics to identify the attributes of the records.
A difficulty with these systems is that records are often incorrectly identified. An error in the identification of a record will propagate to the identification of attributes. As a result, the overall accuracy is limited by the accuracy of the identification of records. Another difficulty with these systems is that they typically do not take into consideration the semantics of the content of a portion that is identified as a record. A person, in contrast, can easily identify records by factoring in the semantics of their content.