The goal of a document search is to take a query, compare the query to a set of known documents, and determine which documents best match the query. The quality of a match—i.e., the decision as to how well a document matches a query—is generally determined by giving each document a “score.” The score is a number that indicates how well the document matches the query.
Scores are typically computed as follows. Given a query, a database is searched that contains information about known documents. For each document that matches the query in some respect (e.g., each document that has at least a word in common with the query), a score data structure is created, which contains a list of values. Each value represents some aspect of how the document compares to the query (e.g., number of nonstopwords matched, whether the exact query phrase is found in the document, whether the matching of query words in the document required “stemming” (i.e., removing “-ing” or “-ed”), etc.). A scalar value called the “score” is created from the information contained in the score data structure. It is possible to compare these two scalar scores and, thus, to determine which document, from among several documents, is the best match. Search results are typically provided in order of the document scores. Thus, the document with the highest score is listed first in the results (since, if the scoring method has done its job, that document should be the best match with the query), the next document list is the document with the second highest score, and so on.
The creation of a scalar score from a score data structure is performed by applying a formula to the information in the score data structure. For example, if the score data structure contains the number of nonstopwords matched, and the aggregate rarity index of each word matched, then a scalar score can be computed using the formula:score=0.85*nonstopwords+0.65*rarity_index.In this example, 0.85 and 0.65 are arbitrary constants to be multiplied by values in the score data structure. (“Nonstopwords” generally include those words that are of significance in distinguishing one document from another. “Nonstopwords” are in contradistinction from “stopwords,” which generally include very common words such as “and,” “the,” “a,” etc. What constitutes a “stopword” or a “nonstopword” in a given search system is a choice made by the system's designers.)
To describe the above example in greater generality, if the score data structure contain n values numbered 0 through n−1, then the score may be computed by the formula:score=c0v0+ . . . +cn-1vn-1 where v1 . . . vn-1 are the values in the score data structure, and c1 . . . cn-1 are the respective constants by which those values are to be multiplied by to arrive at the score. The constants c1 . . . cn-1 essentially represent a judgment about the relative importance of each value in arriving at the score.
In a typical system, the formula that is used to compute the score from the score data structure is hard-coded into the scoring software. Thus, in order to change the formula used to compute the score, the scoring software must be rewritten and recompiled. Thus, it is difficult for existing scoring software to accommodate changes in the way that scores are computed, or tests of new scoring strategies. Some such existing scoring software also has the disadvantage that permitting any changes to the scoring formula requires that the operator of the scoring software have access to the source code and the ability to recompile it. This arrangement may give broader access to the source code than the source code's owner might desire, and also has the disadvantage that encouraging frequent modifications to the source code—no matter how minor—creates the opportunity to introduce errors and bugs into the code. Other systems can be “trained” and thus do not require recompiling in order to change the formula; however, the training process is generally slow, and therefore expensive in terms of machine time.
In view of the foregoing, there is a need for a system that overcomes the drawbacks of the prior art.