There are more than a billion documents available on the World Wide Web (“Web”) over the Internet and this number continues to rapidly increase. These documents (“Web pages”) are stored as files on Web servers. Each of these Web pages has a unique Web address. These addresses are also called Uniform Resource Locators (URLs) or Universal Resource Locators (URLs). URLs are more fully explained in RFC 1738 “Uniform Resource Locators (URL) Berners-Lee, Masinter & McCahill.”
Static Web Pages and Static Addressing
An Internet device, such as a computer using a Web browser, typically accesses a specific Web page by providing its unique Web address (e.g., a URL). That Web page is a static file stored on a Web server. The file is simply copied without change to the requesting Internet device. Every device accessing the static file sees the same results. The stored file remains unchanged until an authorized user actively modifies the file. These types of Web pages are typically called “static.” A typical URL for a static Web page looks like this:http://domain.name.com/pagename.htm
The “http://” is the value of the scheme field and it identifies the protocol scheme being used to transmit over the Internet. For the Web, the protocol scheme typically is HyperText Transfer Protocol (HTTP). The “domain.name.com” is the value of the hostname field and it identifies the domain (or the Web server) that hosts the Web page addressed by the static URL. The actual format of this field depends upon the domain name conventions observed. Typically, the format includes a domain name and an extension (e.g., microsoft.com).
The “pagename” is the value of the path field and/or the file-name field. It may include a path to the specific Web page. It includes the file name of the specific Web page. The “.htm” is the value of the file-extension field and it identifies the format of the file. In this example, the format of the static file is the most common format for a Web page: HyperText Markup Language (HTML).
Dynamic Web Pages and Dynamic Addressing
The opposite of a static Web page is a “dynamic” Web page. A dynamic Web page is one that is created the moment the page is accessed and it is usually created based upon data in a database. Unlike a static Web page, a dynamic Web page that a viewer sees is not stored intact on a Web server. Instead, a dynamic Web page is generated anew each time it is accessed.
A dynamic Web page is generated based upon a stored file containing instructions and an associated database. Therefore, each instance of a generated dynamic Web page may be different from a previously generated page using the same address. There are many different implementations of dynamic Web pages. The implementation differs from each other in the set of instructions used in the stored file on the Web server and the type of database accessed. Examples of such implementations include Active Server Pages (ASP) by the Microsoft Corporation and “JavaBeans” Activation Framework (JAF).
A typical URL for a dynamic Web page may look like this:http://domain.name.com/pagename.asp?parm1=val1&parm2=val2This example uses an ASP implementation. The protocol scheme, hostname, path, and filename fields are the same as those fields in the static URL. However, there are fields in a dynamic address that are different from fields in a static address.
The extension “.asp” is a value of a file-extension field and identifies the format of the dynamic-page-generation instructions. The extension “.asp” indicates that the page is formatted as an Active Server Page (ASP). The “?” symbol is a signal that the URL points to a dynamic page and it separates the portion of the dynamic URL referring to a specific file and the portion of the URL containing parameters.
The “parm1=” and “parm2=” elements identify the names of categorized parameter. The values of these parameters are used to generate the dynamic Web page. “val1” and “val2” are the values of the parameters. The values are typically used to access items in a database. A parameter consists of a parameter name and its associated value. There can be a series of many parameters. The “&” symbol separates each parameter for the other parameters.
Web Search Engines and Spiders
No central bibliographic authority exists to catalog the information found on the tens of millions of Web sites on the Internet. Generally, two basic approaches are available for finding the proverbial needle in this immense Web haystack: a subject directory or a search engine.
Subject directories, such as “Snap” and “MSN”, catalog Web pages and organize them by subject. Each Web page is manually (or automatically) analyzed and categorized. Users can browse through the various categories and subcategories in the subject directories to find a Web site on a particular topic. Typically, Web pages are categorized and added to the directory by professional Web searchers or by user submissions.
A search engine provides a searchable database of indexed keywords. A search engine examines Web pages for specified keywords and returns a list of the Web pages where the keywords were found. Although search engines are general class of programs, the term is often used to specifically describe systems like “Alta Vista” and “Excite” that enable users to search for Web pages on the Web.
A search engine includes two main parts: index searcher and index generator. An index searcher includes a database of indexing keywords of Web pages and logic for searching that database. An index generator includes a “spider” for gathering Web pages and an “indexer” for generating an index into those pages.
Typically, a search engine works by sending out the spider to fetch as many pages as possible. The indexer then reads these pages and creates an index based on the words contained in each page. Each search engine typically uses a proprietary algorithm to create its indices such that, ideally, only meaningful results are returned for each query.
Spiders are sometimes referred to as “Web-spiders”, “robots”, “Web wanderers”, “crawlers”, “Web-crawler”, “ants”, or “worms.” These alternative names refer to programs that have the same basic functionality to visit Web sites by requesting documents from them.
A spider will “crawl” a Web page by following links found on the page. Normal Web browsers (e.g., “Internet Explorer”) are not spiders, because they are operated by humans, and don't automatically retrieve referenced documents.
Provided with a page by a spider, an indexer parses the document and inserts selected keywords into the database with references back to the original location of the source page. How this is accomplished depends on the indexer. Some indexers index the titles of the Web pages or the first few paragraphs. Some parse the entire contents and index all words. Some parse the meta-tag or other special hidden tags.
Meta-tags are special HTML tags that provide information about a Web page. Unlike normal HTML tags, meta-tags do not affect how the page is displayed. Instead, they provide information such as who created the page, how often it is updated, what the page is about, and which keywords represent the page's content. Many search engines use this information when building their indices.
When visiting a Web site, most spiders will check a file called the “robots.txt” file. This file informs the spider whether the spider is authorized to search the site and if so authorized, which pages on the site to retrieve.
Single-destination Web sites called “portals” are often a combination of a “subject directory” and a “search engine.” These portals include a search engine (with its spider and indexer) or are closely associated with a third-party search engine. These portals often include an organized and customized subject directory.
The Invisible Web
The Invisible Web is made up of information stored in Web databases. Unlike pages on the visible Web, information in databases is generally inaccessible to the spiders to compile search engines.
Search engines typically index the Web by visiting Web pages and indexing their content. In particular, the spiders use the links found on pages to find new Web pages. The links include static URLs.
Most spiders tend to ignore the content of a dynamic Web address and thus, the contents of the referenced dynamic Web page. These dynamic Web pages are often ignored because the format of their dynamic URL is different from the URL format of a static Web page. Spiders are often specifically programmed to ignore dynamic addresses because of the complexity of navigating through dynamic pages.
The information found in the databases of dynamic Web sites is not indexed by search engines. Therefore, these dynamic Web sites are not found by those using search engines to search the Web. This huge, unmapped region of the Internet is called the “Invisible Web.”
E-commerce sites with on-line shopping catalogs typically use dynamic Web pages because their databased inventory is changing constantly. These sites wish to be indexed by search engines because to help bring users to their site.
Conventional Solution
To allow search engines to index their sites, dynamic sites (such as e-commerce sites with inventory) periodically generate “snapshots” of their dynamic Web pages. These snapshots are static Web pages generated from corresponding dynamic Web pages, which are generated at a moment in time.
However, there are several significant drawbacks to the “snapshot” approach. In a short period of time, the snapshots no longer represent the current inventory. Periodically generating the snapshots consumes processing and storage resources.
Although the snapshot approach does allow a search engine to index the dynamic Web site, the URLs stored by the search engine are static URLs. Therefore, the search engine ultimately directs a user to the snapshot pages rather than to the preferable dynamic pages. Dynamic sites would prefer users to use their dynamic page to take full advantage of the dynamic nature of the site. If the users are using the snapshot pages, then the information seen by the user may not be accurate.