1. Field
The subject matter disclosed herein relates to data processing.
2. Information
The Internet is a worldwide system of computer networks. The most widely used part of the Internet is the World Wide Web, often abbreviated “WWW” or simply referred to as just “the web”. The web is an Internet service that organizes information through the use of hypermedia. A HyperText Markup Language (“HTML”) or other like markup language, for example, is typically used to specify the contents and format of an electronic document (e.g., a web page).
Through the use of the web, individuals have access to millions of pages of information. However a significant drawback with using the web is that because there is so little organization, at times it can be extremely difficult for users to locate the particular pages that contain the information that is of interest to them. To address this problem, “search engines” have been developed to index a large number of web pages and to provide an interface that can be used to search the indexed information by entering certain words or phases to be queried.
Search engines may generally be constructed using several common functions. Typically, each search engine has one or more “web crawlers” (also referred to as “crawler”, “spider”, “robot”) that “crawls” across the Internet in a methodical and automated manner to locate web documents around the world. Upon locating a document, the crawler stores the document's URL, and follows any hyperlinks associated with the document to locate other web documents. Also, each search engine may include information extraction and indexing mechanisms that extract and index certain information about the documents that were located by the crawler. In general, index information is generated based on the contents of the HTML file associated with the document. The indexing mechanism stores the index information in large databases that can typically hold an enormous amount of information. Further, each search engine provides a search tool that allows users, through a user interface, to search the databases in order to locate specific documents, and their location on the web (e.g., a URL), that contain information that is of interest to them.
With the advent of e-commerce, many web pages are dynamic in their content. Typical examples are products sold at discounted prices that change periodically, or hotel rooms that may change their room fares on a seasonal basis. Therefore, it may be desirable to update crawled content on frequent and near real-time bases.
Information Extraction (IE) systems may be used to gather and manipulate the unstructured and semi-structured information on the web and populate backend databases with structured records. In a website with a reasonable number of pages, information (e.g., products, jobs, etc.) is typically stored in a backend database and is accessed by a set of scripts for presentation of the information to the user. IE systems commonly use extraction templates to facilitate the extraction of desired information from a group of web pages. Generally, an extraction template is based on the general layout of the group of pages for which the corresponding extraction template is defined. Such systems may face difficulties due to the complexity and variability of the large numbers of web pages from which information is to be gathered. Such systems may require a great deal of cost, both in terms of computing resources and time. Such systems often require grouping of structurally similar pages within a website, for example, in order to be able to more accurately extract certain information. Also, relatively large expenses may be incurred in some situations by the need for human intervention during the information extraction process.
With so much information being available and often changing over time, there is a continuing need for methods and apparatuses that allow for certain information to be easily identified and monitored in an efficient manner.