The Internet is a worldwide system of computer networks and is a public, self-sustaining facility that is accessible to tens of millions of people worldwide. The most widely used part of the Internet is the World Wide Web, often abbreviated “WWW” or simply referred to as “the Web.” The Web organizes information through the use of hypermedia. The HyperText Markup Language (“HTML”) is typically used to specify the contents and format of a hypermedia document (e.g., a web page).
A web page is the image or collection of images that is displayed to a user when the web page's HTML file is rendered by a browser application program. Each web page can contain embedded references to resources such as images, audio, video, documents, or other web pages. On the Web, the most common type of reference used to identify and locate resources is the Uniform Resource Locator, or URL. A user using a web browser can reach resources that are embedded in the web page being browsed by selecting “hyperlinks” or “links” on the web page that identify the resources through the resources' URLs.
A web page can be static or dynamic. Static web content generally refers to web content that is fixed and not capable of action or change. A web site that is static can only supply information that is written into the HTML source code and this information will not change unless the change is written into the source code. In contrast, a dynamic web page contains dynamically-generated content that is returned by a server based on a user's request, such as information that is stored in a database associated with the server. The user can request that information be retrieved from a database based on user input parameters. The most common mechanisms for providing input for a dynamic web page in order to retrieve dynamic web content are HTML forms and Java Script links.
Because the Web provides access to millions of pages of information that are often poorly organized, it can be difficult for users to locate particular web pages that contain the information that is of interest to them. To address this problem, a mechanism known as a “search engine” has been developed to index a large number of web pages and to provide an interface that can be used to search the indexed information by entering certain words or phrases (keywords) to be queried. Although there are many popular Internet search engines, they generally include a “web crawler” (also referred to as “crawler”, “spider”, “robot”) that “crawls” across the Internet in a methodical and automated manner to locate Web pages around the world.
There are two common types of “crawling”. In free crawling, when a crawler locates a document, the crawler stores the document and the document's URL, and follows any and all links embedded within that document to locate other web pages. In focused crawling, the crawler tries to crawl only those web pages which contain a specific type of content, or “relevant” web pages.
Although various methods exist for focused crawling, such as the techniques described in U.S. Pat. No. 6,418,433 (“System and Method for Focused Web Crawling”), the crawler may still crawl many irrelevant web pages or miss relevant web pages for a variety of reasons. One reason is that there is a great amount of diversity and variation among web pages in terms of design and structure. Thus, it is very difficult for a focused crawler, which determines which web pages to crawl based on a single set of logic or rules, to accurately ascertain the relevant web pages across a very broad spectrum of web pages. Another reason is that one basic assumption used in focused crawlers, that web pages that contain a specific type of content are linked to each other, is often untrue. Based on this assumption, focused crawlers do not follow any links from a web page which does not contain the specific type of content, and as a result often fail to crawl relevant web pages that are located further along a chain of links. Forms on Web pages also pose difficulties for a focused crawler. Often, it is necessary to fill out a form, such as a search form for job listings, in order to access the relevant web content, such as the job listings and job descriptions. However, due to the immense diversity of forms that exist on the Web, even focused crawlers that apply intelligence in filling out forms are limited and cannot retrieve all relevant content. Other problems with existing focused crawlers include lack of access to restricted content and crawling Web pages in an unnatural or illogical order.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.