1. Field of the Invention
The present invention is related generally to an architecture for an indexer and more particularly to an architecture for an indexer that indexes tokens that may be variable-length, where variable-length data may be attached to at least one token occurrence.
2. Description of the Related Art
The World Wide Web (also known as WWW or the “Web”) is a collection of some Internet servers that support Web pages that may include links to other Web pages. A Uniform Resource Locator (URL) indicates a location of a Web page. Also, each Web page may contain, for example, text, graphics, audio, and/or video content. For example, a first Web page may contain a link to a second Web page.
A Web browser is a software application that is used to locate and display Web pages. Currently, there are billions of Web pages on the Web.
Web search engines are used to retrieve Web pages on the Web based on some criteria (e.g., entered via the Web browser). That is, Web search engines are designed to return relevant Web pages given a keyword query. For example, the query “HR” issued against a company intranet search engine is expected to return relevant pages in the intranet that are related to Human Resources (HR). The Web search engine uses indexing techniques that relate search terms (e.g., keywords) to Web pages.
An important problem today is searching large data sets, such as the Web, large collections of text, genomic information, and databases. The underlying operation that is needed for searching is the creation of large indices of tokens quickly and efficiently. These indices, also called inverted files, contain a mapping of tokens, which may be terms in text or more abstract objects, to their locations, where a location may be a document, a page number, or some other more abstract notion of location. The indexing problems is well-known and nearly all solutions are based on sorting all tokens in the data set. However, many conventional sorting techniques are inefficient.
Thus, there is a need for improved indexing techniques.