The Internet is a worldwide system of computer networks and is a public, self-sustaining facility that is accessible to tens of millions of people worldwide. The most widely used part of the Internet is the World Wide Web, often abbreviated “WWW” or simply referred to as just “the web”. The web is an Internet service that organizes information through the use of hypermedia. The HyperText Markup Language (“HTML”) is typically used to specify the contents and format of a hypermedia document (e.g., a web page).
Through the use of the web, individuals have access to millions of pages of information. However a significant drawback with using the web is that because there is so little organization, at times it can be extremely difficult for users to locate the particular pages that contain the information that is of interest to them. To address this problem, “search engines” have been developed to index a large number of web pages and to provide an interface that can be used to search the indexed information by entering certain words or phases to be queried.
Search engines may generally be constructed using several common functions. Typically, each search engine has one or more “web crawlers” (also referred to as “crawler”, “spider”, “robot”) that “crawls” across the Internet in a methodical and automated manner to locate web documents around the world. Upon locating a document, the crawler stores the document's URL, and follows any hyperlinks associated with the document to locate other web documents. Also, each search engine may include information extraction and indexing mechanisms that extract and index certain information about the documents that were located by the crawler. In general, index information is generated based on the contents of the HTML file associated with the document. The indexing mechanism stores the index information in large databases that can typically hold an enormous amount of information. Further, each search engine provides a search tool that allows users, through a user interface, to search the databases in order to locate specific documents, and their location on the web (e.g., a URL), that contain information that is of interest to them.
With the advent of e-commerce, many web pages are dynamic in their content. Typical examples are products sold at discounted prices that change periodically, or hotel rooms that may change their room fares on a seasonal basis. Therefore, it may be desirable to update crawled content on frequent and near real-time bases.
Information Extraction (IE) systems may be used to gather and manipulate the unstructured and semi-structured information on the web and populate backend databases with structured records. In a website with a reasonable number of pages, information (e.g., products, jobs, etc.) is typically stored in a backend database and is accessed by a set of scripts for presentation of the information to the user. IE systems commonly use extraction templates to facilitate the extraction of desired information from a group of web pages. Generally, an extraction template is based on the general layout of the group of pages for which the corresponding extraction template is defined. Such systems may face difficulties due to the complexity and variability of the large numbers of web pages from which information is to be gathered. Such systems may require a great deal of cost, both in terms of computing resources and time. Also, relatively large expenses may be incurred in some situations by the need for human intervention during the information extraction process.
Reference is made in the following detailed description to the accompanying drawings, which form a part hereof, wherein like numerals may designate like parts throughout to indicate corresponding or analogous elements. It will be appreciated that for simplicity and/or clarity of illustration, elements illustrated in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, it is to be understood that other embodiments may be utilized and structural and/or logical changes may be made without departing from the scope of claimed subject matter. It should also be noted that directions and references, for example, up, down, top, bottom, and so on, may be used to facilitate the discussion of the drawings and are not intended to restrict the application of claimed subject matter. Therefore, the following detailed description is not to be taken in a limiting sense and the scope of claimed subject matter defined by the appended claims and their equivalents.