1. Field of the Invention
The present invention is related to handling anchor text for information retrieval.
2. Description of the Related Art
The World Wide Web (also known as WWW or the “Web”) is a collection of some Internet servers that support Web pages that may include links to other Web pages. A Uniform Resource Locator (URL) indicates a location of a Web page. Also, each Web page may contain, for example, text, graphics, audio, and/or video content. For example, a first Web page may contain a link to a second Web page. When the link is selected in the first Web page, the second Web page is typically displayed.
A Web browser is a software application that is used to locate and display Web pages. Currently, there are billions of Web pages on the Web.
Web search engines are used to retrieve Web pages on the Web based on some criteria (e.g., entered via the Web browser). That is, Web search engines are designed to return relevant Web pages given a keyword query. For example, the query “HR” issued against a company intranet search engine is expected to return relevant pages in the intranet that are related to Human Resources (HR). The Web search engine uses indexing techniques that relate search terms (e.g., keywords) to Web pages.
An anchor may be described as a link or path to a document (e.g., a URL). Anchor text may be described as text associated with a path or link (e.g., a URL) that points to a document. For example, anchor text may be text that labels or encloses hypertext text links in Web documents. Anchor text is collected by Web search engines and is associated with target documents. Also, the anchor text and target documents are indexed together.
Web search engines use context information (e.g., title, summary, language, etc.) to enrich search results. This provides a user with screened search results. Anchor text, however, may not be relevant for use as context information. For example, anchor text may be in a different language than the target document, and use of the anchor text without further processing may result in, for example, a Japanese title for an English document. Moreover, anchor text may not be related to the content of the document. For instance, anchor text may contain common words (e.g., “Click here”) that occur often and are used primarily for navigation, but which do not have any meaningful value as a title. Also, anchor text may be inaccurate, impolite or may contain slang, (e.g., an anchor to a “Network Security Guide” has anchor text “Looking for Trouble?”).
Moreover, generation of context information is especially difficult when the contents of a Web page can not be retrieved (e.g., due to server outage, incompleteness of the retrieval of Web pages for processing by the search engine, robots.txt prohibiting access) or when a document is retrieved but cannot be analyzed (e.g., because the file is a video/audio/multimedia file, is in an unknown or unsupported format, is ill-formed or is password protected).
Most search engines display only a Uniform Resource Locator (URL) in the absence of content of a Web page. That, however, makes it hard for the user to capture the usefulness of a search result without looking at the Web page itself.
Thus, there is a need for improved document processing to provide context information for documents, such as Web pages.