1. Field of the Invention
The present invention is related to a pipelined architecture for Global analysis and index building.
2. Description of the Related Art
The World Wide Web (also known as WWW or the “Web”) is a collection of some Internet servers that support Web pages that may include links to other Web pages. A Uniform Resource Locator (URL) indicates a location of a Web page. Also, each Web page may contain, for example, text, graphics, audio, and/or video content. For example, a first Web page may contain a link to a second Web page. When the link is selected in the first Web page, the second Web page is typically displayed.
A Web browser is a software application that is used to locate and display Web pages. Currently, there are billions of Web pages on the Web.
Web search engines are used to retrieve Web pages on the Web based on some criteria (e.g., entered via the Web browser). That is, Web search engines are designed to return relevant Web pages given a keyword query. For example, the query “HR” issued against a company intranet search engine is expected to return relevant pages in the intranet that are related to Human Resources (HR). The Web search engine uses indexing techniques that relate search terms (e.g., keywords) to Web pages.
Global analysis computations may be described as extracting properties from a global view of documents in a corpus (e.g., documents available on the Web). One example of a global analysis computation is the page rank computation. A page rank computation takes as input a directed graph in which every document in the corpus is a node and every hyperlink between documents is an edge. Then, the page rank computation produces as output a global rank for each document in the corpus. Other examples of global analysis computations are duplicate detection (i.e., the identification of pages with similar or the same content) and template detection (i.e., identification of which parts of a Web page are part of a site template).
Search engines that use global analysis computations typically need to have the output of these computations ready before indexing the corpus. For instance, rank values computed by page rank may be used to determine the order of documents in the index, and the results of the duplicate detection computation may be used to filter out which documents should not be indexed. Having to perform all global analysis computations before the creation of the search indices is a problem in scenarios where freshness requirements impose constraints on the time allowed for index creation. In general, global computations are costly, since their computational time is proportional to the number of documents in the corpus, which in the case of the Web or some textual and biological databases is very large.
Additionally, conventional index structures designed for large scale search engines are not well tuned for incremental updates. Thus, incrementally updating an index is expensive in conventional systems.
Thus, there is a need for improved global analysis and index building.