1. Field
Embodiments of the invention relate to global anchor text processing.
2. Description of the Related Art
The World Wide Web (also known as WWW or the “Web”) is a collection of some Internet servers that support Web pages that may include links to other Web pages. A Uniform Resource Locator (URL) indicates a location of a Web page. Also, each Web page may contain, for example, text, graphics, audio, and/or video content. For example, a first Web page may contain a link to a second Web page. When the link is selected in the first Web page, the second Web page is typically displayed.
A Web browser is a software application that is used to locate and display Web pages. Currently, there are billions of Web pages on the Web.
Web search engines are used to retrieve Web pages on the Web based on some criteria (e.g., entered via the Web browser). That is, Web search engines are designed to return relevant Web pages given a keyword query. For example, the query “HR” issued against a company intranet search engine is expected to return relevant pages in the intranet that are related to Human Resources (HR). The Web search engine uses indexing techniques that relate search terms (e.g., keywords) to Web pages.
An anchor may be described as a link or path to a document (e.g., a URL). Anchor text may be described as text associated with a path or link (e.g., a URL) that points to a document. For example, anchor text may be text that labels or encloses hypertext text links in Web documents. Anchor text is collected by Web search engines and is associated with target documents. Also, the anchor text and target documents are indexed together.
Anchor text may also be described as content found in one HyperText Markup Language (HTML) document (the “referring” document) that annotates a link to another document (the “target” document). Anchor text is contained lexically inside an anchor tag (<A> . . . </A>). Anchor text can improve search quality because it encodes a human editor's judgment about the area of relevance of the target document. To make anchor text searchable, though, the anchor text has to be indexed as if the anchor text were part of the target document's content, even though the anchor text actually enters the search system as part of other, referring documents.
When Web search engines process documents in a corpus (e.g., retrieve and index documents), it is not possible to keep all the documents in memory until all cross links are known. Thus, the traditional solution is to catalog document content and anchor text separately, and then run an offline global-integration process to combine anchor text and document content for indexing.
If integration is postponed until all of the corpus's content has been crawled (i.e., retrieved), then all the anchor text is available, and the combined index only needs to be constructed once. But if this is done, the content-only index can not be created until the whole corpus has been crawled. Alternatively, the content-only index can be written and be made available, but no anchor-text searching will be possible until after the integration phase, which requires updating the index.
Thus, there is a need in the art for improved global anchor text processing.