Web pages accessible via the Internet contain a vast amount of information. A web page may contain information about various types of objects such as products, people, papers, organizations, and so on. For example, one web page may contain a product review of a certain model of camera, and another web page may contain an advertisement offering to sell that model of camera at a certain price. As another example, one web page may contain a journal article, and another web page may be the homepage of an author of the journal article. A person who is searching for information about an object may need information that is contained in different web pages. For example, a person who is interested in purchasing a certain camera may want to read reviews of the camera and to determine who is offering the camera at the lowest price.
To obtain such information, a person would typically use a search engine to find web pages that contain information about the camera. The person would enter a search query that may include the manufacturer and model number of the camera. The search engine then identifies web pages that match the search query and presents those web pages to the user in an order that is based on how relevant the content of the web page is to the search query. The person would then need to view the various web pages to find the desired information. For example, the person may first try to find web pages that contain reviews of the camera. After reading the reviews, the person may then try to locate a web page that contains an advertisement for the camera at the lowest price.
The person viewing the web pages would typically like to know whether the web pages contain information for the same object. For example, a person would like to know whether a certain product review and a certain product advertisement are for the same object. In the example of a camera, a person would like to know which reviews and products are for the camera of interest. It can, however, be difficult for the person viewing the web pages to determine whether a review and an advertisement are for the same product. In many cases, a web page does not include a unique identifier for the product for which it is providing information. For example, a product review may identify the manufacturer and model of a camera, but not a sub-model number, and an advertisement may identify the manufacturer, but only include a general description of the camera. A person viewing the product review and the advertisement may not be able to ascertain whether they are for the same camera.
It would be desirable to have a technique that would automatically identify when information of web pages relates to the same object. The knowledge that different sources of information relate to the same object can be used in many different applications. For example, a search engine may use the knowledge to determine the relevance of or to group the web pages of the results. As another example, a shopping portal may use the knowledge to identify the web-based vendor with the lowest purchase price. As another example, a repository of scientific papers may use the knowledge to identify additional information about the authors of the papers.