The present invention generally relates to information retrieval from documents and, more particularly, to the calculation of the relevance of the documents or portions thereof obtained from an automated search.
Many methods are currently available to search databases for documents relevant to a query posed by a user seeking information. The most common type of query-based search methodology capable of locating portions of documents requires the ranking of arbitrary sub-documents contained within a group of documents. Such a methodology involves arbitrarily defining groups of sub-documents, processing a query on the sub-documents to generate scores and retrieving sub-documents having the best (i.e., highest) scores. Scores are determined by the total number and value of each query term contained within a sub-document.
A problem associated with conventional document search methods is that they are not capable of locating regions within documents that are relevant to a search query. Instead conventional search methodologies only rank xe2x80x9csub-documents,xe2x80x9d whose relevance are then determined arbitrarily at a later time.
Another problem associated with conventional document search methods is that they only rank pre-defined sub-documents. They are not capable of defining regions of documents based on the location of query search terms contained within a document.
Yet another problem associated with conventional document search methods is that they are not capable of taking the repetition of identical search terms in a particular document region into account when determining the relevance of a document. Instead, conventional search method simply increase the score of a sub-document based on the total number or value of all terms contained in a sub-document.
Yet another problem associated with conventional document search methods is that they are not capable of returning approximate phrase matches. Instead, conventional search methods are only capable of returning exact phrase matches, and are not capable of compensating for noisy speech or inaccurate transcripts.
The aforementioned and related drawbacks associated with conventional document retrieval analysis and methods are substantially reduced or eliminated by the present invention. The present invention is directed to a method for defining regions of documents based on the location of query search terms contained within a document. The method includes ranking regions of documents, taking the repetition of identical search terms as well as the relative importance of distinct search terms within a region into account when determining the relevance of a region. According to an exemplary embodiment, the method for finding and ranking portions of documents responsive to a query comprises the steps of: providing at least one database containing documents; providing a query relating to information sought by a user, the query containing at least one search term; assigning a value to each document responsive to at least one search term contained in the query indicating the relevance of the document; defining regions within the document, each region corresponding to a selected one of the search terms; and assigning a value to each region responsive to at least one search term contained in the query indicating the relevance of the region. After the relevant documents have been assigned a value, they are provided to the user in order of relevance.
An advantage provided by the present invention is that it is capable of locating regions within documents that are relevant to a search query.
Another advantage provided by the present invention is that it is capable of defining regions of documents based on the location of query search terms contained within a document.
Yet another advantage provided by the method of the present invention is that it is capable of taking the repetition of identical search terms in a region into account when determining the relevance of a document, or a particular region thereof.
Yet another advantage provided by the method of the present invention is that it is capable of returning approximate phrase matches, in order to compensate for noisy speech or inaccurate transcripts.