World Wide Web-General
The Internet is a worldwide system of computer networks and is a public, self-sustaining facility that is accessible to tens of millions of people worldwide. The most widely used part of the Internet is the World Wide Web, often abbreviated “WWW” or simply referred to as just “the Web”. The web is an Internet service that organizes information through the use of hypermedia. The HyperText Markup Language (“HTML”) is typically used to specify the contents and format of a hypermedia document (e.g., a web page).
In this context, an HTML file is a file that contains the source code for a particular web page. A web page is the image or collection of images that is displayed to a user when a particular HTML file is rendered by a browser application program. Unless specifically stated, an electronic or web document may refer to either the source code for a particular web page or the web page itself. Each page can contain embedded references to images, audio, video or other web documents. The most common type of reference used to identify and locate resources on the Internet is the Uniform Resource Locator, or URL. In the context of the web, a user, using a web browser, browses for information by following references that are embedded in each of the documents. The HyperText Transfer Protocol (“HTTP”) is the protocol used to access a web document and the references that are based on HTTP are referred to as hyperlinks (formerly, “hypertext links”).
Search Engines
Through the use of the web, individuals have access to millions of pages of information. However a significant drawback with using the web is that because there is so little organization to the web, at times it can be extremely difficult for users to locate the particular pages that contain the information that is of interest to them. To address this problem, a mechanism known as a “search engine” has been developed to index a large number of web pages and to provide an interface that can be used to search the indexed information by entering certain words or phrases to be queried. These search terms are often referred to as “keywords”.
Indexes used by search engines are conceptually similar to the normal indexes that are typically found at the end of a book, in that both kinds of indexes comprise an ordered list of information accompanied with the location of the information. An “index word set” of a document is the set of words that are mapped to the document, in an index.
For example, an index word set of a web page is the set of words that are mapped to the web page, in an index. For documents that are not indexed, the index word set is empty.
Although there are many popular Internet search engines, they are generally constructed using the same three common parts. First, each search engine has at least one, but typically more, “web crawler” (also referred to as “crawler”, “spider”, “robot”) that “crawls” across the Internet in a methodical and automated manner to locate web documents around the world. Upon locating a document, the crawler stores the document and the document's URL, and follows any hyperlinks associated with the document to locate other web documents. Feature extraction engines then process the crawled and locally stored documents to extract structured information from the documents. In response to a search query, some structured information that satisfies the query (or documents that contain the information that satisfies the query) is usually displayed to the user along with a link pointing to the source of that information. For example, search results typically display a small portion of the page content and have a link pointing to the original page containing that information.
Second, each search engine contains an indexing mechanism that indexes certain information about the documents that were located by the crawler. In general, index information is generated based on the contents of the HTML file associated with the document. The indexing mechanism stores the index information in large databases that can typically hold an enormous amount of information. Third, each search engine provides a search tool that allows users, through a user interface, to search the databases in order to locate specific documents, and their location on the web (e.g., a URL), that contain information that is of interest to them.
The search engine interface allows users to specify their search criteria (e.g., keywords) and, after performing a search, an interface for displaying the search results. Typically, the search engine orders the search results prior to presenting the search results interface to the user. The order usually takes the form of a “ranking”, where the document with the highest ranking is the document considered most likely to satisfy the interest reflected in the search criteria specified by the user. Once the matching documents have been determined, and the display order of those documents has been determined, the search engine sends to the user that issued the search a “search results page” that presents information about the matching documents in the selected display order.
Dead Links
Dead links are a big problem on the web. Many Information Integration Systems (IIS) on the Web, such as search engines, job portals, shopping search sites, travel search sites, applications that include syndication feeds (e.g., Really Simple Syndication (RSS) based applications), and many more, usually gather and display information collected all over the Web and attempt to serve as a primary resource for all information needs for users. The information initially rendered to the user is concise and relevant and almost always has a link or URL pointing to the actual source of the information that, at one time, contained the requested information in its entirety.
Because the time taken by the entire crawling and indexing process is considerably large and due to the ever changing nature of Web pages, there is always a high probability that (1) a link or URL is now “dead” (the page is gone), (2) the user is redirected to another page which does not have the expected information, or (3) the content itself has been modified by the Web master and, consequently, no longer contains the expected information. It detracts from a Web user's experience to click on a link looking for some information, only to find that the page has disappeared or that the page does not have the information claimed by the search results. Hence, it is significantly beneficial for Information Integration Systems to detect and prevent such dead links from being shown to the user and, therefore, preventing a bad user experience.
Dead links on the Web can occur due to different reasons and, therefore, can be classified into different types, as follows. (1) The web page is no longer locatable using the link. This may happen if the page or Web site has been moved or removed (i.e., is dead). (2) The Web page is still alive, but the content has changed and no longer has the information that is expected to be found there. (3) The URL of the Web page requires HTTP “POST” data to be supplied in order to view the page contents, because the expected content may be dynamically generated by a script based on the POST data supplied. (4) Visiting that Web page requires a cookie to be set from another page and, therefore, the page that sets the cookie needs to be visited before the desired page. Some existing systems attempt to resolve one or more of the problems that result in what manifest to the user as dead links. However, such systems have their shortcomings, some of which are as follows.
Type 1 dead links are detectable by simply reviewing the header timestamp of a response to an HTTP request, or a connection failure. Detecting dead links in this manner is most commonly used in almost all, if not all, existing systems.
One approach to managing dead links in an information integration system is to refresh the crawl quite frequently. For a crawl that is directed to a relatively small subset of the Web, the crawl may be refreshed every couple days or so. However, for a crawl directed to a relatively large subset of the Web, or the entire Web, the crawl refresh rate is more on the order of every 30 days or so. Given the high volume of pages from the Web, and the dynamic nature of the Web, it is impractical, if not impossible, to use the refresh crawl approach effectively when the goal is to index as much of the entire Web as possible.
One approach to managing dead links in an information integration system employs a program that checks the validity of all the links by visiting each of the indexed pages that were previously crawled. Even though this approach might prevent dead links from appearing in such systems, this approach results in a significant number of live pages appearing as dead links to these programs. For example, type 4 dead links appear valid because during the crawl, and even during a refresh, the cookies to reach these links are set in the previous pages. However, these links appear as dead links when the users try to visit the page directly from the link (e.g., a link presented in a search result), in which case the cookie is not set.
Systematically checking the validity of all the indexed pages requires bombarding Web sites with multiple requests (i.e., one per URL), which may result in DOS (denial of service) attacks, and which may consequently prompt the Web masters to block the programs from accessing their web sites. Such a strategy also results in the dead link detection process itself taking too long (e.g., perhaps as long as the actual crawl) if the volume of pages requested is significantly large, thereby rendering the process relatively ineffective.
One approach to managing dead links in an information integration system is to only fetch and check the header of a page, and to review a timestamp that is supposed to indicate the last time the page content was changed. However, many Web masters do not update this timestamp, thereby this process is also relatively ineffective.
One approach to managing dead links in an information integration system is to use an extraction tool-based dead link detection tool through which prerequisite URLs are manually detected and provided to the system on a per site basis. This approach is manual and can be labor-intensive and error prone.
Based on the foregoing, there is a need for improved techniques for detecting actual dead links, and detecting them in an efficient and site-friendly manner.
Any approaches that may be described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.