1. Field
Embodiments of the invention relate to define a web crawl space.
2. Description of the Related Art
The World Wide Web (also known as WWW or the “Web”) is a collection of some Internet servers that support Web pages that may include links to other Web pages. A Uniform Resource Locator (URL) indicates a location of a Web page. Also, each Web page may contain, for example, text, graphics, audio, and/or video content. For example, a first Web page may contain a link to a second Web page. Thus, the Web may be described as a series of interconnected web pages with links connecting the web pages from different web sites together. A web site may be described as a related set of Web pages.
A Web browser is a software application that is used to locate and display Web pages. Currently, there are billions of Web pages on the Web.
A web search engine uses a web crawler (sometimes known as a spider) to retrieve web pages from the web. The web search engine then indexes the content of the crawled web pages to make them searchable by users. Web search engines are used to retrieve Web pages on the Web based on some criteria (e.g., entered via the Web browser). That is, Web search engines are designed to return relevant Web pages given a keyword query. For example, the query “HR” issued against a company intranet search engine is expected to return relevant pages in the intranet that are related to Human Resources (HR). The Web search engine uses indexing techniques that relate search terms (e.g., keywords) to Web pages.
A website may be described as a domain or a subdomain of a domain. Most websites are specified by names that may be called domain names, subdomain names or hostnames.
A typical web search engine crawls a “web crawl space”. A web crawl space may be described as the Web or some portion of the Web. In order to define the web crawl space, the web search engine usually needs to know from an administrator where to start a crawl, often called a seed, and what the boundaries are for the crawl. That is, a webspace is typically defined by what to crawl (described by allow rules), what not to crawl (described by deny rules), and a seed list (which is a list of seed names (e.g., domain names) with which to start the crawl).
For example, the web site myexample.com may have the following structure:    en.myexample.com for English webpages,    zh.myexample.com for Chinese webpages,    es.myexample.com for Spanish webpages,
In this example, www.myexample.com acts as a homepage that presents the user with a language of choice and directs the user to the appropriate subdomains. Furthermore, en.myexample.com may have the following structure:    en.myexample.com/archives for outdated stories,    en.myexample.com/current for current stories,    en.myexample.com/sports for sports related stories,    en.myexample.com/entertainment for the latest in Hollywood.
The following Example(1) describes a webspace that is defined by an allow rule and a deny rule with a seed list of one seed name www.myexample.com: