World Wide Web—General
The Internet is a worldwide system of computer networks and is a public, self-sustaining facility that is accessible to tens of millions of people worldwide. The most widely used part of the Internet is the World Wide Web, often abbreviated “WWW” or simply referred to as just “the web”. The web is an Internet service that organizes information through the use of hypermedia. The HyperText Markup Language (“HTML”) is typically used to specify the contents and format of a hypermedia document (e.g., a web page).
In this context, an HTML file is a file that contains the source code for a particular web page. A web page is the image or collection of images that is displayed to a user when a particular HTML file is rendered by a browser application program. Unless specifically stated, an electronic or web document may refer to either the source code for a particular web page or the web page itself. Each page can contain embedded references to images, audio, video or other web documents. The most common type of reference used to identify and locate resources on the Internet is the Uniform Resource Locator, or URL. In the context of the web, a user, using a web browser, browses for information by following references that are embedded in each of the documents. The HyperText Transfer Protocol (“HTTP”) is the protocol used to access a web document and the references that are based on HTTP are referred to as hyperlinks (formerly, “hypertext links”).
Static web content generally refers to web content that is fixed and not capable of action or change. A web site that is static can only supply information that is written into the HTML source code and this information will not change unless the change is written into the source code. When a web browser requests the specific static web page, a server returns the page to the browser and the user only gets whatever information is contained in the HTML code. In contrast, a dynamic web page contains dynamically-generated content that is returned by a server based on a user's request, such as information that is stored in a database associated with the server. The user can request that information be retrieved from a database based on user input parameters.
The most common mechanisms for providing input for a dynamic web page in order to retrieve dynamic web content are HTML forms and Java Script links. HTML forms are described in Section 17 (entitled “Forms”) of the W3C Recommendation entitled “HTML 4.01 Specification”, available from the W3C® organization; the content of which is incorporated by this reference in its entirety for all purposes as if fully disclosed herein.
Search Engines
Through the use of the web, individuals have access to millions of pages of information. However a significant drawback with using the web is that because there is so little organization to the web, at times it can be extremely difficult for users to locate the particular pages that contain the information that is of interest to them. To address this problem, a mechanism known as a “search engine” has been developed to index a large number of web pages and to provide an interface that can be used to search the indexed information by entering certain words or phases to be queried. These search terms are often referred to as “keywords”.
Indexes used by search engines are conceptually similar to the normal indexes that are typically found at the end of a book, in that both kinds of indexes comprise an ordered list of information accompanied with the location of the information. An “index word set” of a document is the set of words that are mapped to the document, in an index. For example, an index word set of a web page is the set of words that are mapped to the web page, in an index. For documents that are not indexed, the index word set is empty.
Although there are many popular Internet search engines, they are generally constructed using the same three common parts. First, each search engine has at least one, but typically more, “web crawler” (also referred to as “crawler”, “spider”, “robot”) that “crawls” across the Internet in a methodical and automated manner to locate web documents around the world. Upon locating a document, the crawler stores the document's URL, and follows any hyperlinks associated with the document to locate other web documents. Second, each search engine contains an indexing mechanism that indexes certain information about the documents that were located by the crawler. In general, index information is generated based on the contents of the HTML file associated with the document. The indexing mechanism stores the index information in large databases that can typically hold an enormous amount of information. Third, each search engine provides a search tool that allows users, through a user interface, to search the databases in order to locate specific documents, and their location on the web (e.g., a URL), that contain information that is of interest to them.
The search engine interface allows users to specify their search criteria (e.g., keywords) and, after performing a search, an interface for displaying the search results. Typically, the search engine orders the search results prior to presenting the search results interface to the user. The order usually takes the form of a “ranking”, where the document with the highest ranking is the document considered most likely to satisfy the interest reflected in the search criteria specified by the user. Once the matching documents have been determined, and the display order of those documents has been determined, the search engine sends to the user that issued the search a “search results page” that presents information about the matching documents in the selected display order.
Web Crawlers
There are many web crawlers that crawl and store content from the web. The web is becoming more dynamic by the day, and a larger share of the content is only accessible from behind Flash (a vector-graphic animation technology), HTML forms, JavaScript links, etc. There is no readily available technique for a crawler to get past HTML forms, which are meant primarily for real users, and JavaScript content, which are written with browsers in mind, in order to access the dynamic web content behind the HTML forms and Java Scripts. Consequently, a basic crawler gets only the static content of the web, but fails to crawl dynamic content, also referred to as the “deep web” and the “invisible web”.
For domain-specific crawlers (also referred to as “vertical crawlers”) to access dynamic content, the crawlers typically must have some mechanism to fill out forms and follow JavaScript links. For instance, in the jobs domain, most job postings are requested by submitting HTML forms. Possible approaches to identifying and submitting forms, for vertically crawling a given web site, include manual approaches, in which a human supplies the information for the crawler to use to fill in the forms used by the web site. The human examines each web site that requires form-filling, and provides information in a script or configuration file, instructing the crawler how to fill each form on the site. Manual approaches are labor intensive and not easily scalable.
Some shortcomings associated with the manual approaches are as follows.
(1) Scripts and configuration files are site specific. It would be a complex and expensive manual process to write these configurations for all the domain-specific sites on the web.
(2) Configuration files have to be rewritten if the site structure changes. That is, if the HTML form is changed or the web site itself has changed, these configuration files must be manually rewritten.
(3) JavaScript-based forms or links are extremely difficult to manually identify. For some web sites, execution of Java Script functions is necessary to submit forms or to generate the link to the next page. These JavaScript functions could involve a considerable amount of code, which makes it difficult to manually identify and interpret the code for the form submission logic.
Based on the foregoing, there is a need for improved techniques for crawling dynamic web content.
Any approaches that may be described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.