An internet (including, but not limited to, the Internet, intranets, extranets and similar networks), is a network of computers, with each computer being identified by a unique address. The addresses are logically subdivided into domains or domain names (e.g. vistaprint.com, vistaprint.co.uk, uspto.gov, etc.) which allow a user to reference the various addresses. A web, (including, but not limited to, the World Wide Web (WWW)) is a group of these computers accessible to each other via common communication protocols, or languages, including but not limited to Hypertext Transfer Protocol (HTTP). Resources on the computers in each domain are identified with unique addresses called Uniform Resource Locator (URL) addresses (e.g. http://www.uspto.gov/forms/index.jsp). A web site is any destination on a web. It can be an entire individual domain, multiple domains, or even a single URL.
Resources can be of many types. Resources with a “.htm” or.“html” URL suffix are text files, or pages, formatted in a specific manner called Hypertext Markup Language (HTML). HTML is a collection of tags used to mark blocks of text and assign meaning to them. A specialized computer application called a browser can decode the HTML files and display the information contained within, Often, an HTML file will contain references to image or document files stored on a computer connected to the internet and which is to be loaded and displayed within the page presented to the user within the browser. For example, a company logo may be stored as an image file on a company server. Web pages of the company web site may include HTML image tags that specify the location of an image to be displayed (<img source=“pix.logo.jpg”). Preferably, the image tags also include attribute information containing dimensional information about the image to allow the browser to accurately allocate space for the image when rendering the page on the user's display. Typically, the text on the page is rendered first, and then referenced sources such as images and documents are then downloaded and rendered on the display by the browser. If the dimensional attributes are not specified, the browser may have to shift text around after the image loads in order to accommodate the image—an undesirable effect from the user's point of view. An image tag may also include “alt” attributes which can be used to define a name or other identifying information for the image. When a user hovers over the image or image placeholder in the browser, a popup appears containing the name or identifying information.
A hyperlink is a navigable reference in any resource to another resource on the Internet.
An internet Search Engine is a web application that includes a crawler program which visits resources (by following every link on a site or beginning URL) on the internet and extracts data about the visited resources into Resource Repository. Some search engines store the entire resource along with information about the resource in the Resource Repository. Others store only part of the content of a visited page. An indexer program processes the Resource Repository and generates an index to allow faster and easier retrieval of search query results. A Search Engine also includes a Query Engine which receives queries (typically text or boolean queries), examines the index, and returns a set of search results which the Search Engine determines as the best match for the query.
A search engine crawler is a program that travels over the internet and accesses remote resources. The crawler inspects the text of resources on web sites. Navigable references to other web resources contained in a resource are called hyperlinks. The crawler can follow these hyperlinks to other resources. The process of following hyperlinks to other resources, which are then indexed, and following the hyperlinks contained within the new resource, is called crawling.
The main purpose of an internet search engine is to provide users the ability to query the database of internet content to find content that is relevant to them. A user can visit the search engine web site with a browser and enter a query into a form (or page), including but not limited to an HTML form or an ASPX form, provided for the task. The query may be in several different forms, but most common are words, phrases, or questions. The query data is sent to the search engine through a standard interface, including but not limited to the Common Gateway Interface (CGI). The CGI is a means of passing data between a client, a computer requesting data or processing and a program or script on a server, a computer providing data or processing. The combination of form and script is hereinafter referred to as a script application. The search engine will inspect its index for the URLs of resources most likely to relate to the submitted query. The list of URL results is returned to the user, with the format of the returned list varying from engine to engine. Usually the search results will consist of ten or more hyperlinks per search engine page, where each hyperlink is described and ranked for relevance by the search engine by means of various information such as the title, summary, language, and age of the resource. The returned hyperlinks are typically sorted by relevance, with the highest rated resources near the top of the list.
Depending on the query, the returned search results may or may not be considered highly relevant by the user. Often, web sites contain pages, and web pages contain elements, that have content that is not relevant to the purpose of the site or page. For example, many web sites include index pages that contain all of the key words on the site, yet the page itself contains no significant information as purportedly sought by the user via the query terms. The index page is not usually relevant to the purpose of the site, yet contains a keyword in the query terms and thus may appear in the search results as highly relevant to the user's search. The same problem may occur at the page level. On multi-page web sites, every page of the site typically includes one or more navigation menus with links to other pages of the site. The names of the links can be general or quite specific. If the link names are general, for example, “Contact Us”, the fact that the navigation menu is crawled on every page is generally not a problem—that is, since so many web pages contain this text, any given page having the term “Contact Us” will generally not rise any higher in the search results for a query that contains the term “Contact” than any other page that also contains the term. If the link names are specific, for example, “Business Cards”, then a search query containing the term “business card” may return multiple pages of the web site based on the navigation menu link name which do not actually contain any other connection with the term “business card”. In these instances, it would therefore be useful to be able to limit the types of pages and elements searched by the crawler.
U.S. Pat. No. 6,253,198 entitled “Process For Maintaining Ongoing Registration For Pages On A Given Search Engine” describes two methods of controlling the resource files that are added to a search engine database. The first method includes the use of a robots.txt file, which is a site-wide, search engine specific control mechanism. The second method includes the use of the ROBOTS META HTML tag which is resource file specific, but not search engine specific. Most internet search engines respect both methods, and will not index a file if robots.txt, ROBOTS META tag, or both informs the internet search engine to not index a resource. The robots.txt, the ROBOTS META tag and other methods of search engine control is intended to allow a site administrator to control what, if any, of the web site content is crawled by outside Search Engines. For providing search capability of its own web site, the administrator may wish to allow more in-depth searching yet control the scope of the search on a global, page, and element basis. Furthermore, the site administrator may wish to apply different search rules to different specific pages and elements. Neither the Robots.txt file nor the ROBOTS META tag allow this functionality.
It would be desirable to be able to perform a crawl on only particular areas (domains) of a web site and only particular types of pages and/or elements on a page. It would also be desirable to allow a user setting up a crawl to configure rules for the crawl, from the top level URL down to individual page elements. It would further be desirable to allow the user to set up rules on a per-domain, per-page, and per-element basis, and to allow rules inherency.
The World Wide Web consists of thousands of domains and millions of pages of information. The indexing and cataloging of content on an Internet search engine takes large amounts of processing power and time to perform due to the sheer volume of information to retrieve and index, network delays, and page loading latencies. Accordingly, web crawlers are typically multi-threaded in order to crawl multiple areas of web in parallel and to make best use of available CPU and memory. Each thread requests a single page, but since multiple threads are spawned, crawlers are much more aggressive at fetching content than a regular user, and can process that content at a much faster rate.
It may occasionally be desirable to provide search capability for a single web site or area of the web. For example, it may be desirable for a company to provide search capability on the content of its web site to allow visitors to the web site to easily locate pages and/or products of interest. Existing multi-threaded search engines are designed to crawl the World Wide Web and therefore must be aggressive by nature in order to crawl the Web in a reasonably short (at least, for the momentous task it is charged to perform) amount of time. For crawling small areas of the web, for example a company web site, such search engines may be too powerful in that they may have the effect of overwhelming the server hosting the web site through bombardment by multiple crawling threads. This results in the undesired effect of rendering the server slow or even non-responsive to visitors or users of the web site.
It would therefore further be desirable to provide a mechanism for allowing a user to configure the speed and parallelism of a crawl to accommodate various levels of crawl.