Existing internet search engines rely on the preprocessing of webpage information prior to performing a user specified web search, in which nearly the entire content of the WWW is crawled by a ‘spider’ module (web crawler) which logs and retrieves webpages while an indexer module analyzes the word and syntactic content of each webpage in order to index and store that content in various datasets for rapid access during a user query. Words occurring in a webpage can be represented as word_IDs (word identifiers) which can be linked (using a lexicon hash table, for example) to doc_IDs (document identifiers) that represent the webpage documents in which those words occur. The doc_IDs may be stored a doclist index containing additional information which identifies the total number of occurrences of a word within a webpage and the context of each occurrence. The web search engine can then retrieve and rank webpages in part by matching user queried keywords to the respective word_IDs and following pointers (i.e. links) into the doclist index which contains word hitlists providing the number and context of occurrences of each keyword within each webpage document that is a hit for (i.e., contains) that keyword. The higher the number of occurrences and the more significant the context of each occurrence of a keyword in a webpage, the higher the relevancy score computed for the webpage, which can be referred to as an Information Retrieval (IR) score. Also, webpages that contain hits for a greater number of the user's query keywords receive a higher IR score than those that hit on fewer keywords. While the term webpage is used, the above and following concepts apply more broadly to web items that may not be webpages, such as indexes, data files and other documents. The term ‘web items’ refers to data contents of the internet and WWW.
One prominent internet search engine design can store a lexicon dataset representing millions of words using word_IDs and a hash table of pointers indicating which webpage documents each of the words occurs in. The search engine has access to forward index and inverted index datasets which record the total number of occurrences of each of the words in the respective webpages, as well as hitlist datasets which contain context information indicating the type of word occurrence in addition to the number of hits. Type of occurrence includes information such as whether the word occurs in the URL, title, body, or anchor hypertext of a particular webpage, as well as position of occurrence, font style, and relative font size of each occurrence of the word on the webpage. These context attributes are incorporated into a computation of a type-weight for each occurrence of a word. The type-weights make up a vector that is indexed by type. Also, the search engine counts the number of hits (i.e., number of occurrences) of each type in the hit list and then converts every count into a count-weight. Count-weights increase linearly with counts at first but quickly taper off, so that beyond a certain point increasing counts no longer contribute to the count-weight. The IR score for the document is computed as the dot product between the vector of count-weights and the vector of type-weights.
In addition to an IR score, the above search engine can compute a page ranking score using an algorithm which evaluates the quantity and quality of inbound hyperlinks of each webpage. The higher the quality and quantity of the inbound hyperlinks pointing to a webpage, the higher the page ranking score will be for that webpage. The search engine combines the hyperlink-based page ranking score with the IR score to derive a final rank for a webpage which determines whether that webpage will be listed in the Search Engine Results Page (SERP), and where in the listing it will appear based on its rank relative to other webpages listed in the SERP.