In a text document search, a user typically enters a query into a search engine. The search engine evaluates the query against a database of indexed documents and returns a ranked list of documents that best satisfy the query. A score, representing a measure of how well the document satisfies the query, is algorithmically generated by the search engine. Commonly-used scoring algorithms rely on splitting the query up into search terms and using statistical information about the occurrence of individual terms in the body of text documents to be searched. The documents are listed in rank order according to their corresponding scores so the user can see the best matching search results at the top of the search results list.
Many such scoring algorithms assume that each document is a single, undifferentiated string of text. The query of search terms is applied to the text string (or more accurately, to the statistics generated from the undifferentiated text string that represents each document). However, documents often have some internal structure (e.g., fields containing titles, section headings, metadata fields, etc.), and reducing such documents to an undifferentiated text string loses any searching benefit provided by such structural information.
Some existing approaches attempt to incorporate the internal structure of documents into a search by generating statistics for individual document fields and generating scores for individual fields. The score for an individual document is then computed as a weighted sum of scores for its fields. However, in such existing approaches, the weights applied to individual fields of different documents do not adequately consider the influence of document length, field lengths, and the possible combinations of term frequencies of different query terms in different fields on the overall score for a given document.