Many search engine services, such as Google and Overture, provide for searching for information that is accessible via the Internet. These search engine services allow users to search for display pages, such as web pages, that may be of interest to users. After a user submits a search request (i.e., a query) that includes search terms (i.e., query terms), the search engine service identifies web pages that may be related to those search terms. To quickly identify related web pages, the search engine services may maintain a mapping of keywords to web pages. This mapping may be generated by “crawling” the web (i.e., the World Wide Web) to identify the keywords of each web page. To crawl the web, a search engine service may use a list of root web pages to identify all web pages that are accessible through those root web pages. The keywords of any particular web page can be identified using various well-known information retrieval techniques, such as identifying the words of a headline, the words supplied in the metadata of the web page, the words that are highlighted, and so on. The search engine service may generate a relevance score to indicate how relevant the information of the web page may be to the search request based on the closeness of each match, web page importance or popularity, and so on. The search engine service then displays to the user links to those web pages in an order that is based on a ranking determined by their relevance.
Search engines can more generally be used to search a corpus of documents with a web page being one type of document in the corpus. The other types of documents may include articles published in journals, dissertations, technical reports, patents, and so on. With such a corpus, it may be desirable to present the documents ranked based on their relevance to the query. One common technique for ranking the relevance of a document to a query is based on term frequency and inverse document frequency. Term frequency refers to the number of occurrences of a query term within a document, and inverse document frequency refers to the inverse of the number of documents that contain that query term. Generally, a document with a more occurrences of a query term tends to be more relevant, and a query term that occurs in fewer documents is a more important term. One approach for combining term frequency and inverse document frequency into a relevance score for a document is given by the following equation:
                              ∑                      t            ∈            Q                          ⁢                              w                          (              1              )                                ⁢                                                    (                                                      k                    1                                    +                  1                                )                            ·              tf                                      K              +              tf                                                          (        1        )            where t is a query term of query Q, tf is term frequency of t within the document, k1, is a constant, and K and w(1) are defined by the following equations. K is represented by the following equation:
                    K        =                              k            1                    ·                      [                                          (                                  1                  -                  b                                )                            +                              b                ·                                  l                  avdl                                                      ]                                              (        2        )            where l is the document length, avdl is the average document length in the corpus, and b is a constant, w(1) is a Robertson/Sparck Jones weight represented by the following equation:
                    log        ⁢                              N            -            n            +            0.5                                n            +            0.5                                              (        3        )            where N is the number of documents within the corpus and n is the number of documents containing the query term t within the corpus. Equation 3 is based on inverse document frequency. Thus, the score of relevance given by Equation 1 is based on term frequency, inverse document frequency, and document length.
The relevance of Equation 1 considers each query term independently. It is well known that the proximity of one query term to another query term affects relevance. For example, if the query is “home buying,” then a document that contains the phrase “home buying” may be more relevant than a document that contains the words “home” and “buying” separated by 100 words. One approach for factoring in the proximity of query terms into relevance uses relevance derived from “adjacent” pairs of query terms. Query terms are considered adjacent when the only intervening terms are non-query terms. For example, if the document contains the phrase “at the home page, you can select the buying option for tips” and the query is “home buying tips,” then “home” and “buying” are adjacent query terms that are separated by five non-query terms, a distance of five. However, “home” and “tips” are not adjacent, because the query term “buying” is between them. The relevance of adjacent pairs of query terms is represented by the following equation:
                              ∑                                    (                                                                    t                    i                                    ·                                ⁣                                  t                  j                                            )                        ∈            S                          ⁢                              min            ⁡                          (                                                w                  i                                      (                    1                    )                                                  ,                                  w                  j                                      (                    1                    )                                                              )                                ·                                                    (                                                      k                    1                                    +                  1                                )                            ·                                                ∑                                      occ                    ⁡                                          (                                                                        t                          i                                                ,                                                  t                          j                                                                    )                                                                      ⁢                                  tpi                  ⁡                                      (                                                                  t                        i                                            ,                                              t                        j                                                              )                                                                                      K              +                                                ∑                                      occ                    ⁡                                          (                                                                        t                          i                                                ,                                                  t                          j                                                                    )                                                                      ⁢                                  tpi                  ⁡                                      (                                                                  t                        i                                            ,                                              t                        j                                                              )                                                                                                          (        4        )            where ti and tj represent a pair of adjacent query terms and tpi is represented by the following equation:
                              tpi          ⁡                      (                                          t                i                            ,                              t                j                                      )                          =                  1.0                                    d              ⁡                              (                                                      t                    i                                    ,                                      t                    j                                                  )                                      2                                              (        5        )            where d(ti, tj) is the distance between the query terms ti and tj. The relevance of a document based on query term pairs (i.e., bigrams) is then combined with the relevance based on single query terms (i.e., unigrams) to give the overall relevance of a document.
A disadvantage with combining the unigram relevance and bigram relevance into document relevance is that it is difficult to estimate what their relative contributions should be. Moreover, a linear combination of these relevance scores may be inconsistent with the non-linear nature of traditional term frequency and inverse document frequency metrics.