Much of the information we use today is stored on computers or other devices such as PDAs that use computer technology. E-mail messages, word processing documents, and digital photographs are just of few common types of content items that are available to users. Because of the large sizes of these collections of content items, it can be difficult for users to find the information for which they are looking. A search engine is a software system used to facilitate the process of finding information in a collection of content items. Users express their information need in the form of a query (which typically comprises one or more query terms), and the search engine returns a list of content items that match the query in some manner.
Some search engines work by scanning each content item at the time that the user issues a given query, looking for the pattern of text described by the query. Many tools that search for files on a user's local machine use this type of search, often called a “direct file” or “stream” search. For example, the search utility in Microsoft Windows XP and the grep command that is part of the set of utilities provided with UNIX-based operating systems such as Linux both use direct file search. As the number of content items increases, however, directly searching through content items becomes too slow. Instead, many search engines actually search in a concise representation of the contents of one or more content items called an “inverted index,” or simply an “index.”
In order to create an inverted index, a given content item, such as an HTML document, is first broken into a list of words, a process known as tokenization. The complexity of the tokenization problem depends on the language in which the content item was written. For example, tokenizing Chinese text is more difficult than tokenizing English text, since word boundaries are not marked with spaces. After tokenization, words may be normalized to a standard form. For example, plural endings and other suffixes may be removed, a process known as “stemming” or “morphological analysis.” Again, this process may be more complex for highly inflected languages. In addition, certain very common words known as “stop words” may be omitted. Finally, each occurrence of each word is recorded in the inverted index. The entire process of transforming the content item from its original form into a set of entries in an inverted index is known as “indexing.”
The inverted index is a data structure consisting of a table of lists. Each entry in the table is accessed by a unique word, and each item in the list for a given word indicates a content item in which that word occurred. These items are called “postings,” and the lists are called “posting lists.” A posting contains an identifier for the content item containing the word, and may also include additional information about how often or where the word appeared in the content item.
When a user provides a query to a search engine that employs an inverted index, the system breaks the query into words in much the same way that the system processes content items. The system then looks in the table to find the posting list for each word. Each posting list represents the set of content items containing the word. If the user's query is interpreted as a Boolean AND, then the intersection of the sets for each word is computed. If it is interpreted as a Boolean OR, then the union of the sets is computed. In most search engines, a relevance score is computed for each candidate content item in the result set, and only the top-scoring candidates are retrieved. A variety of factors may determine the relevance score, including the frequency of occurrence of the query words, their statistical distinctiveness, and properties of the content item such as its modification date.
The Internet is a worldwide network of interconnected networks, which includes both public and private local and wide area networks of computers. One of the most commonly used Internet services is the World Wide Web (“WWW”), which consists of billions of content items, known as web pages, interconnected by hypertext links which allow users to navigate from a “source” page (the page containing the link) to a “target” page (the page pointed to by the link). Each page on the Web has a unique address known as a Uniform Resource Locator (“URL”). Hypertext links on the web contain two pieces of information: the URL of the target page, and a short piece of text, known as anchortext, that describes the target page.
Due to the large scale of the Web and the unique nature of the interlinked pages, web search engines typically employ more complex relevance ranking functions. In addition to the ranking features used in traditional search engines, web search engines also rely on information based on the connectivity of the page, such as the number of pages linking to it, in determining the relevance score of a search result.
Unfortunately, existing indexes used by search engines may not capture the precise verbiage that a user query comprises, raising issues of the relevance of content items in a result set. In addition to the foregoing, users are increasingly presented with disinformation when attempting to locate content items on the Internet. Due to the exploitation of shortcomings in existing search algorithms, users are confronted with issues of trust regarding content items that they locate on the Internet, including the content contained within such content items.
Therefore, new sources of information on which to base searches, as well as methods of using the same, are needed. Furthermore, new sources of information on which to base the ranking of content items in a result set are needed, as well as techniques of using the same, which may be used alone or in conjunction with existing searching and ranking techniques known in the art. Additional sources of information provide new ways to index and rank content items and the content contained therein, leading to more reliable search results for users. Furthermore, new techniques for indexing this information are needed.