1. World Wide Web—General
The Internet is a worldwide system of computer networks and is a public, self-sustaining facility that is accessible to tens of millions of people worldwide. The most widely used part of the Internet is the World Wide Web, often abbreviated “WWW” or simply referred to as just “the web”. The web is an Internet service that organizes information through the use of hypermedia. The HyperText Markup Language (“HTML”) is typically used to specify the contents and format of a hypermedia document (e.g., a web page).
In this context, an HTML file is a file that contains source code for a particular web page. Typically, an HTML document includes one or more pre-defined HTML tags and their properties, and text enclosed between the tags. A web page is the image or collection of images that is displayed to a user when a particular HTML file is rendered by a browser application program. Unless specifically stated, an electronic or web document may refer to either the source code for a particular web page or the web page itself. Each page can contain embedded references to images, audio, video or other web documents. The most common type of reference used to identify and locate resources on the Internet is the Uniform Resource Locator, or URL. In the context of the web, a user, using a web browser, browses for information by following references that are embedded in each of the documents. The HyperText Transfer Protocol (“HTTP”) is the protocol used to access a web document and the references that are based on HTTP are referred to as hyperlinks (formerly, “hypertext links”).
2. Search Engines
Through the use of the web, individuals have access to millions of pages of information. However a significant drawback with using the web is that because there is so little organization to the web, at times it can be extremely difficult for users to locate the particular pages that contain the information that is of interest to them. To address this problem, a mechanism known as a “search engine” has been developed to index a large number of web pages and to provide an interface that can be used to search the indexed information by entering certain words or phases to be queried. These search terms are often referred to as “keywords”.
Indexes used by search engines are conceptually similar to the normal indexes that are typically found at the end of a book, in that both kinds of indexes comprise an ordered list of information accompanied with the location of the information. An “index word set” of a document is the set of words that are mapped to the document, in an index. For example, an index word set of a web page is the set of words that are mapped to the web page, in an index. For documents that are not indexed, the index word set is empty.
Although there are many popular Internet search engines, they are generally constructed using the same three common parts. First, each search engine has at least one, but typically more, “web crawler” (also referred to as “crawler”, “spider”, “robot”) that “crawls” across the Internet in a methodical and automated manner to locate web documents around the world. Upon locating a document, the crawler stores the document's URL, and follows any hyperlinks associated with the document to locate other web documents. Second, each search engine contains information extraction and indexing mechanisms that extract and index certain information about the documents that were located by the crawler. In general, index information is generated based on the contents of the HTML file associated with the document. The indexing mechanism stores the index information in large databases that can typically hold an enormous amount of information. Third, each search engine provides a search tool that allows users, through a user interface, to search the databases in order to locate specific documents, and their location on the web (e.g., a URL), that contain information that is of interest to them.
The search engine interface allows users to specify their search criteria (e.g., keywords) and, after performing a search, an interface for displaying the search results. Typically, the search engine orders the search results prior to presenting the search results interface to the user. The order usually takes the form of a “ranking”, where the document with the highest ranking is the document considered most likely to satisfy the interest reflected in the search criteria specified by the user. Once the matching documents have been determined, and the display order of those documents has been determined, the search engine sends to the user that issued the search a “search results page” that presents information about the matching documents in the selected display order.
3. Structure of Web Pages
The Internet today has an abundance of data presented in HTML pages. However, it is still an arduous task to find informative content from all the other content. Many online merchants present their goods and services in a semi-structured format using scripts to generate a uniform look-and-feel template and present the information at strategic locations in the template. Identifying such positions on a page and extracting and indexing relevant information is key to the success of any data-centric application like search.
With the advent of e-commerce, most webpages are now dynamic in their content. Typical examples are products sold at discounted price that keep changing on sites between Thanksgiving and Christmas every year, or hotel rooms that change their room fares on a seasonal basis. With advertisement and user services critical for business success, it is imperative that crawled content be updated on frequent and near real-time basis.
These examples show that on the Web, especially on large sites, webpages are generated dynamically through scripts that place the data elements from a database in appropriate positions using a defined template. By understanding these templates, one could separate out the more useful information on the pages from the text put in by the script as part of the template.
4. Information Extraction Systems
Information Extraction (IE) systems are used to gather and manipulate the unstructured and semi-structured information on the web and populate backend databases with structured records. Most IE systems are either rule based (i.e., heuristic based) extraction systems or automated extraction systems. In a website with a reasonable number of pages, information (e.g., products, jobs, etc.) is typically stored in a backend database and is accessed by a set of scripts for presentation of the information to the user.
IE systems commonly use extraction templates to facilitate the extraction of desired information from a group of web pages. Generally, an extraction template is based on the general layout of the group of pages for which the corresponding extraction template is defined. One technique used for generating extraction templates is referred to as “template induction”, which automatically constructs templates (i.e., customized procedures for information extraction) from labeled examples of a page's content.
While an example has been provided of using templates to extract information from web pages, templates can be used to extract information from electronic documents having other than an HTML structure. For example, templates can be used to extract information from documents structured in accordance with XML (eXtensible Markup Language).
Any approaches that may be described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.