It is common in a variety of settings to conduct a search of free text data to identify those data records that satisfy a predefined query. These searches may be conducted of various data sources including document collections, databases or other data sources, such as those available over the Internet. Regardless of the data source, searches may be conducted to identify the data records that include one or more search terms identified by the query. The data records that are returned from a search may then be reviewed, for example, to learn more about the subject of the search.
The quality of a search may be defined by its recall and its precision. Recall relates to the number or percentage of correct answers that are returned relative to all of the correct answers within the data source(s) that are searched. Searches that identify a greater percentage of the correct answers have a greater recall. Precision relates to the number or percentage of answers that are returned that are correct. Thus, searches that provide a greater percentage of correct answers have a greater precision.
Typically, there is a tradeoff between recall and precision and depending upon the purpose of a search, it may be desirable for the search to have a greater recall, a greater precision or both. For example, it is generally desirable for the searches conducted by engineers and scientists to have a relatively high recall since the engineers and scientists are interested in all of the results from the data source that satisfy the query and not just some of the results from the data source that satisfy the query. Indeed, in contrast to a conventional Internet search in which multiple pages of search results may be returned with users typically only reviewing a few of the data records identified by the search, such as the data records from the first page or two of the search results, an engineer or scientist is more likely to review each or at least a much greater percentage of the data records identified by the search since the engineer or scientist is frequently trying to consider all the relevant information and not just a small subset of the relevant information.
The quality of search results may be limited, however, in instances in which the free text data is noisy. In this regard, data may be noisy in instances in which terms within a data record are abbreviated, misspelled or represented by an acronym. Data may also be noisy in instances in which the authors of different data records utilize different terms to represent the same or similar concepts. Moreover, users conducting a search, such as subject matter experts conducting research, may not anticipate all of the variations for a search term that may exist and may not be accustomed to constructing the complex queries that would be required in order to return all of the data records that include a search term or terms related to the search term. Thus, the recall of a search of free text data may not be as substantial as desired in instances in which the search is not structured in a manner so as to identify both the initial search terms and related terms.