1. Field of the Invention
The present invention is related to handling redirects in a search engine.
2. Description of the Related Art
The World Wide Web (also known as WWW or the “Web”) is a collection of some Internet servers that support Web pages that may include links to other Web pages. A Uniform Resource Locator (URL) indicates a location of a Web page. Also, each Web page may contain, for example, text, graphics, audio, and/or video content. For example, a first Web page may contain a link to a second Web page.
A Web browser is a software application that is used to locate and display Web pages. Currently, there are billions of Web pages on the Web.
Web search engines are used to retrieve Web pages on the Web based on some criteria (e.g., entered via the Web browser). That is, Web search engines are designed to return relevant Web pages given a keyword query. For example, the query “HR” issued against a company intranet search engine is expected to return relevant pages in the intranet that are related to Human Resources (HR). The Web search engine uses indexing techniques that relate search terms (e.g., keywords) to Web pages.
Some Web pages do not contain content, but, instead, contain a “redirect” to another Web page. For example, if a given Web page A (i.e., a source) redirects to another Web page B (i.e., a target), the Web browser shows Web page B whenever a request for Web page A is received. There are several ways of implementing redirects, including Hyper Text Transfer Protocol (HTTP) redirects (e.g., with HTTP return codes 301 and 302), the use of a META REFRESH tag in Hyper Text Markup Language (HTML), and scripting languages such as JavaScript.
Redirects are a challenge to Web search engines since the content of a target page should be used to index a source page. For instance, if Web page A redirects to Web page B, then the URL of Web page A should be indexed with the content of Web page B because Web page A has no content, just the redirect (e.g. the JavaScript code that does the redirect). Moreover, redirects may form chains (e.g., Web page A redirects to Web page B, which in turn redirects to Web page C), in which case the transitive closure relationship should be resolved. Additionally, redirect chains may have cycles (e.g., Web page A redirects to Web page B, which redirects to Web page C, which redirects to Web page A), in which case these Web pages should not be indexed because the Web browser cannot display them. Conventional search engines do not handle redirects well. Additionally, conventional search engines handle redirects when “crawling” (i.e., retrieving Web pages), and so they lose the ability to use redirect information in conjunction with, for example, ranking, duplicate detection, and anchor text processing.
Thus, there is a need for improved redirect processing.