An internet (including, but not limited to, the Internet, intranets, extranets and similar networks), is a network of computers, with each computer being identified by a unique address. The addresses are logically subdivided into domains or domain names (e.g. ibm.com, pbs.org, and oranda.net) which allow a user to reference the various addresses. A web, (including, but not limited to, the World Wide Web (WWW)) is a group of these computers accessible to each other via common communication protocols, or languages, including but not limited to Hypertext Transfer Protocol (HTTP). Resources on the computers in each domain are identified with unique addresses called Uniform Resource Locator (URL) addresses (e.g.http://www.ibm.com/products/laptops.htm). A web site is any destination on a web. It can be an entire individual domain, multiple domains, or even a single URL.
Resources can be of many types. Resources with a ".htm" or."html" URL suffix are text files, or pages, formatted in a specific manner called Hypertext Markup Language (HTML). HTML is a collection of tags used to mark blocks of text and assign meaning to them. A specialized computer application called a browser can decode the HTML files and display the information contained within. A hyperlink is a navigable reference in any resource to another resource on the internet.
An internet Search Engine is a web application consisting of
1. Programs which visit and index the web pages on the internet. PA1 2. A database of pages that have been indexed PA1 3. Mechanisms for a user to search the database of pages. PA1 A. A resource that is modified or removed by its owner after it is indexed by a search engine could be incorrectly listed in that search engine for months until an agent visits the site to register the change. PA1 B. A resource may be modified since the last time it was indexed, in which case a user may never be directed to the new content, or incorrectly directed to content that is no longer present. PA1 C. Deleted resources can create the impression for a search engine user that a whole web site has shut down, that the information the user is looking for is removed, or that the web site is not being maintained, when the resources may have simply been moved to another location on the site as part of regular site maintenance. PA1 D. Automated tools such as search engines apply their own criteria in order to determine the relevancy of a particular resource for a particular query. These automated criteria can lead to the search engine returning spurious, misleading, or irrelevant results to a particular query. For example, a recent search for the nursery rhyme "Rub a dub dub, three men in a tub" on a particular search engine resulted in the top ten search results containing discussions of various issues among consenting males. PA1 E. Automated agents are not always able to understand the context of the pages they index, as illustrated by the example above. As such, their one-dimensional capabilities allow web masters to create the impression that the resources on a particular site contain information they do not. This is done to direct traffic to sites by providing incorrect or misleading information, a process called spamming. PA1 F. Most automated agents are incapable of processing the content of resources that are binary in nature, such as applications written in the programming language Java. These applications can display text data, but do not use text or HTML files to do so. Instead, the information is encoded in binary form in the application. As such, an agent cannot determine the content of a resource coded in this manner. PA1 1. The search engine can maintain perfect ongoing registration and indexing of pages by re-indexing at a set interval, as frequently as the web site owner chooses. PA1 2. The search engine can maintain an intelligent database, not limited by the conditions that automated agents have imposed on them and not easily corruptible by web site owners with less ethical practices. PA1 3. The search engine provides a guarantee of integrity to all users, ultimately providing a more valuable service to both users and web site owners. PA1 A. The search engines the owner of the web site would like the resource to be indexed by. PA1 B. The date and time of the last index by each search engine. PA1 C. The date and time a resource was last modified according to the local indexing engine. PA1 D. Flags to indicate whether a specific resource requires updating, inclusion, or removal from a particular search engine database. PA1 A. A resource is marked as deleted if the resource is listed in the current database, but cannot be retrieved. PA1 B. A resource is marked as modified if the date and time of last modification in the current database is earlier than the date and time of last modification provided by the web server for the resource. PA1 C. A resource is added and marked as added if it is present on the web server, but not yet in the database and the web site manager has opted to add it either manually or automatically. PA1 1. The task of spidering the web site has been distributed to the web site owner (see FIG. 1, Box 205c). PA1 2. The web site owner has the capability to protect brand image from being injured by a search engine pointing potential visitors to deleted, irrelevant, or incorrect resource information. PA1 3. The search engine owner has a higher degree of database integrity. Less information storage space is wasted on spurious, nonexistent or incorrect data. PA1 4. The web site owner can directly indicate the keywords and other descriptions that are most appropriate for each resource in the site, as opposed to using the cumbersome HTML `Meta` tag to specify the keywords for the agent. Keywords are words that are particularly relevant to a particular resource and might be used on a search engine to locate that resource. PA1 5. The search engine can create a reverse index of keywords that the individual site owners have identified for each resource. For example, a user could query for a list of all web sites that have listed `dog` as an appropriate keyword. PA1 6. The internet search engine could be used by users to query the content of a particular web site, as opposed to requiring a web site based search engine to index the content. This saves administration effort and computing resources at the web site.
Agents are programs that can travel over the internet and access remote resources. The internet search engine uses agent programs called Spiders, Robots, or Worms, among other names, to inspect the text of resources on web sites. Navigable references to other web resources contained in a resource are called hyperlinks. The agents can follow these hyperlinks to other resources. The process of following hyperlinks to other resources, which are then indexed, and following the hyperlinks contained within the new resource, is called spidering.
The main purpose of an internet search engine is to provide users the ability to query the database of internet content to find content that is relevant to them. A user can visit the search engine web site with a browser and enter a query into a form (or page), including but not limited to an HTML form, provided for the task. The query may be in several different forms, but most common are words, phrases, or questions. The query data is sent to the search engine through a standard interface, including but not limited to the Common Gateway Interface (CGI). The CGI is a means of passing data between a client, a computer requesting data or processing and a program or script on a server, a computer providing data or processing. The combination of form and script is hereinafter referred to as a script application. The search engine will inspect its database for the URLs of resources most likely to relate to the submitted query. The list of URL results is returned to the user, with the format of the returned list varying from engine to engine. Usually it will consist of ten or more hyperlinks per search engine page, where each hyperlink is described and ranked for relevance by the search engine by means of various information such as the title, summary, language, and age of the resource. The returned hyperlinks are typically sorted by relevance, with the highest rated resources near the top of the list.
The World Wide Web consists of thousands of domains and millions of pages of information. The indexing and cataloging of content on an Internet search engine takes large amounts of processing power and time to perform. With millions of resources on the web, and some of the content on those resources changing rapidly (by the day, or even minute), a single search engine cannot possibly maintain a perfect database of all Internet content. Spiders and other agents are continually indexing and re-indexing WWW content, but a single World Wide Web site may be visited by an agent once, then not be visited again for months as the queue of sites the search engine must index grows. A site owner can speed up the process by manually requesting that resources on a site be re-indexed, but this process can get unwieldy for large web sites and is in fact, a guarantee of nothing.
Many current internet search engines support two methods of controlling the resource files that are added to their database. These are the robots.txt file, which is a site-wide, search engine specific control mechanism, and the ROBOTS META HTML tag which is resource file specific, but not search engine specific. Most internet search engines respect both methods, and will not index a file if robots.txt, ROBOTS META tag, or both informs the internet search engine to not index a resource. The use of robots.txt, the ROBOTS META tag and other methods of index control is advocated for the purposes of the present invention.
Commonly, when an internet search engine agent visits a web site for indexing, it first checks the existence of robots.txt at the top level of the site. If the search agent finds robots.txt, it analyses the contents of the file for records such as:
User-agent: * PA0 Disallow: /cgi-bin/SRC PA0 Disallow: /stats PA0 User-agent: Scooter PA0 Disallow: PA0 User-agent: * PA0 Disallow: /avstuff PA0 &lt;META NAME="ROBOTS" CONTENT="NOINDEX, NO FOLLOW"&gt; PA0 &lt;META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW"&gt; PA0 &lt;META NAME="ROBOTS" CONTENT="INDEX, NO FOLLOW"&gt; PA0 &lt;META NAME="ROBOTS" CONTENT="INDEX, FOLLOW"&gt;
The above example would instruct all agents not to index any file in directories named /cgi-bin/SRC or /stats. Each search engine agent has its own agent name. For example, AltaVista (currently the largest Internet search engine) has an agent called Scooter. To allow only AltaVista access to directory Iavstuff, the following robots.txt file would be used:
The ROBOTS META tag is found in the file itself. When the internet search engine agent indexes the file, it will look for a HTML tag like one of the following:
INDEX and NOINDEX indicate to all agents whether or not the file should be indexed by that agent. FOLLOW and NOFOLLOW indicate to all agents whether or not they should spider hyperlinks in this document.
For current internet search engines, the present invention process uses the CGI program(s) provided by the search engine in order to add, modify and remove files from the search engine index. However, the process can generally only remove a file from the search engine index if the file no longer exists or if the site owner (under the direction of the process) has configured the site, through the use of robots.txt,the ROBOTS META tag or other methods of index control, so that the search engine will remove the file from its index.
The duration of time between the first time a site is indexed and the next time that information is updated has led to several key problems:
The present invention provides a mechanism for search engine and web site managers to maintain as perfect a registration of web site content as is possible. By augmenting or replacing existing agents and manual registration methods with specialized tools on the local web site (and, when feasible, at the search engine), the current problems with search engine registration and integrity can be eliminated.