Searching a body of documents for specific information is increasingly important, in an increasing number of systems. This introduction examines searching and ranking techniques for responding to a specific search request. Typically, providing best available information calls for scoring information in each document and then by ranking the information or the documents according to relevance. A variety of techniques are used to complete such tasks.
Custom built solutions typically offer acceptable performance for searching large bodies of data. Examples include those available for searching the Internet. Some of the commercially available solutions include biases that either speed the search or qualify data.
Experience has shown that it is typically a substantial challenge to meet performance expectations within the confines of certain systems. Examples of systems where challenges arise include general purpose database management systems and content management systems. For developers of searching algorithms, challenges to meeting expectations in such systems include balancing the concepts of ranking and approximation, as well as providing for a generality of purpose. These, and other concepts, are discussed in more detail to provide some perspective.
Ranking and approximation specify what to return when there are too many or too few results. One may consider these concepts to be at different ends of a single continuum. In order to provide desired searching capabilities, it is considered preferable that typical database systems should incorporate ranking into the generic and extensible architecture of the database engine. Typical database systems do not integrate the concepts of ranking and approximation. New and different ranking criteria and ranking functions should be easily incorporated into a query processing runtime. Preferably, database systems should not use a biased ranking method.
For more perspective, consider the following aspects of ranking text in databases. Note that information retrieval (IR) literature contains many ranking heuristics. A few of these heuristics, to which later reference will be made, include the Term Frequency Inverted Document Frequency (TFIDF) function, Static Rank functions, Searching by Numbers, Lexical Affinities (LA) and Salience Levels (SL), as well as other functions.
One common method for ranking text is by use of the TFIDF score. This is calculated as:
      TFIDF    ⁡          (              q        ,        d            )        =            ∑              t        ∈        q              ⁢                  φ                  t          ,          d                            Γ        t            
Here, q represents the query, φt,d is the number of times term t occurs in document d, divided by the total number of terms in the document d. Γt is the number of documents which contain term t. A discussion of TFDIF is provided in the reference “Managing Gigabytes,” I. H. Witten, A. Moffat, and T. C. Bell. Morgan Kaufman, San Francisco, 1999.
One example of a search engine that uses the Static Rank is the GOOGLE search engine, which uses PageRank. Typically, Static Ranks are used in combination with query dependent ranks such as TFIDF. As an example, scoring where the combination is used can be accomplished using a metric such as:COMBIN(q, d)=αSTATIC(d)+TFIDF(q, d)
The combination presumes that some documents are generally better than others, and therefore, should be favored during retrieval. A discussion of Static Ranks is presented in the publication by S. Brim and L. Page, and entitled “The Anatomy of a Large-Scale Hypertextual Web Search Engine,” as published in Proceedings of the 7th International World Wide Web Conference (WWW7), 1998. This publication also discusses PageRank.
As an example of keyword based querying of structured datasets, consider the following example of Searching by Numbers. In the example, a user enters a few numbers into a search bar, for instance, “1 GHz, 256M” and the search system translates the query automatically to something like:(processorSpeed≈1 GHz).and. (memoryCapacity≈256MB)
In addition to automated translation to a structured form, results are ranked based on a discrepancy between the requested value and actual value returned for the two parameters of interest. A discussion of Searching by Numbers is presented in the publication by R. Agrawal and R. Srikant, entitled “Searching with Numbers,” published in the Proceedings of the 2002 International World Wide Web Conference (WWW2002), Honolulu, Hi., May 2002.
Lexical Affinities and Salience Levels are described as score boosting heuristics. In the case of Lexical Affinities (LA), a score is boosted when two terms in the query appear within a small window of each other. In the case of Salience Levels (SL), the score is boosted when a query term appears with increased prominence such as in the title, a paragraph heading, or with bold and/or italicized text. Score boosting methods such as the use of LA and SL are commonly used in modern information retrieval systems. A discussion of Lexical Affinities and Salience Levels is provided in the publication by Y. Maarek and F. Smadja, and entitled “Full text indexing based on lexical relations: An application: Software libraries,” appearing in the Proceedings of the Twelfth International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 198-206, Cambridge, Mass., June 1989. Further examples are provided in the publication by E. M. Voorhees and D. K. Harman, and entitled “Overview of the Tenth Text Retrieval Conference (TREC-10),” appearing in the Proceedings of the Tenth Text Retrieval Conference (TREC-10), National Institute of Standards and Technology, 2001.
Another popular scoring function, referred to as OKAPI, is discussed in the publication by S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu, and M. Gatford, entitled “Okapi at TREC-3,” appearing in Proceedings of the Third Text REtrieval Conference (TREC-3), pages 109-126. National Institute of Standards and Technology. NIST, 1994.
Presently, there is a debate by those skilled in the art over the choice of “term at a time” (TAAT) search strategies versus “document at a time” (DAAT) search strategies. One example of the various perspectives on these strategies is provided in the publication by H. Turtle and J. Flood, entitled “Query evaluation: Strategies and optimizations,” appearing in Information Processing and Management, 31(6):831-850, 1995.
Typically, a TAAT search engine maintains a sparse vector spanning the documents. The TAAT search engine iterates through the query terms and updates the vector with each iteration. The final state of the vector becomes the score for each document. TAAT search engines are relatively easy to program and new ranking functions are easily included in TAAT runtimes. Conversely, a DAAT search engines make use of document indices. Typically, a DAAT runtime search engine iterates through documents subject to the search and scores a document before proceeding to the next one. A heap maintains the current top l documents identified.
In the context of large data sets, it is considered by some that the index based DAAT runtime search engine outperforms the vector based TAAT search engine during query execution. However, DAAT runtimes are hard to implement. For example, each DAAT ranking engine is typically built as a custom system, rather than being implemented on top of a general purpose platform such as a database system. Typically, this is due to the fact that commercial database indices have little or no support for the ranking heuristics used in the text.
To address this issue, DAAT engines have typically been built using a two layer architecture. The user's query would first be translated into a Boolean query. A lower stage performs retrieval based on the Boolean query (or near Boolean query, that is a Boolean query with a “near” operator) which is then passed to a ranking stage for a complete evaluation. Thus, the Boolean stage acts as a filter which eliminates documents which have little or no relevance to the query or are otherwise unlikely to be in the result set.
From a runtime optimization perspective, two layer DAAT architecture has two potential problems. First, there is the need for a middle layer which translates a query into the Boolean form. This can be a complicated process. For example, translating to a Boolean “AND” of all the query terms may not return a potential hit, while translating to a Boolean “OR” may be an ineffective filter. Thus, depending on how effective the Boolean filters are, the DAAT search may end up performing a significant amount of extra input and output operations. Second, effective translations can lead to complicated intermediate Boolean queries. Consequently, the filters associated with even simple scoring functions such as TFIDF or COMBIN can present daunting optimization problems.
For further reference, the merge operator and the zig-zag join operators are described in the publication by Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garcia-Molina, entitled “Building a Distributed Full-Text Index for the Web,” appearing in ACM Transactions on Information Systems, 19(3):217-241, July 2001; and the publication by H. Garcia-Molina, J. Ullman, and J. Widom, entitled “Database System Implementation,” Prentice-Hall, 2000.
From a functional perspective, a scoring function uses more information than the Boolean filters. For instance, a TFIDF requires determination of the quantity φt,d, which requires more resources than determining if a term is present in a document. Per document scores (such as STATIC) and per term statistics (such as Γt) are used in scoring. Scoring in Searching by Numbers requires use of the numerical values in addition to indices. Heuristics such as LA and SL require information about where in the document and in what context any term occurred. In short, a typical information retrieval engine may use many heuristics and combine them in complicated ways. Examples are provided in two further publications. Consider the publication by David Carmel, Doron Cohen, Ronald Fagin, Eitan Farchi, Michael Herscovici, Yoelle S. Maarek, and Aya Soffer, entitled “Static index pruning for information retrieval systems,” appearing in the Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 41-50, New Orleans, La., September 2001. Such schemes often require data to be provided to support the ranking decisions.
In a typical modem day search engine, the scoring function and the runtime engine are co-designed. This prior art arrangement is depicted in FIG. 1.
Referring to FIG. 1, a prior art search engine 5 receives a query 8 and provides results 9. This process calls for use of parser 3, which builds a token table 4 from a base table 2. Typically, the base table 8 is a collection of documents 7. The token table 4 includes tokens 13, which may be considered as segments or compilations of relevant information from a document 7. From the token table 4, an index table 6 is built. Using the input query 8, the prior art search engine 5 typically includes an embedded scoring function to provide scoring and ranking of information in the index table 6.
While this arrangement usually means that runtime optimization performs well, such designs come at the cost of versatility. For example, some text search engines have been customized to the extent of including special purpose hardware to speed up critical operators. This has made sense for certain applications, especially in the case where there are few scoring functions which are of concern. In such instances, there is no need for versatility as understanding the scoring function allows for better scheduling of the runtime operators. However, such engines are typically not very useful in contexts other than those for which they were developed.
It is important that a generic search engine provide a generic interface to support the varied search functions. In particular, the scoring function used to rank results should be “plug and play.” That is, what is needed is a runtime search engine for text search and ranking where the scoring function is queried as a “black box.”