The present invention relates generally to the field of information retrieval, and more particularly to query expansion of terms of a search, based on data used for other purposes.
Query expansion (QE) is the process of reformulating a seed query to improve retrieval performance in information retrieval operations. In the context of computer-based searches, query expansion involves evaluating a user's input and expanding the search query to generate additional document matches. Query expansion involves techniques such as finding synonyms of words and finding various morphological forms of words, and including the synonyms and morphological forms of words in the search query.
In information retrieval, precision of the retrieval is the fraction of retrieved instances that are relevant, while recall is the fraction of relevant instances that are retrieved. Both precision and recall are therefore based on relevance of a set of criteria used in a search query. For a given amount of search results, high precision refers to a search returning results that are substantially more relevant than irrelevant. High recall refers to the search returning a quantity that includes most of the relevant results. Algorithms used for information retrieval, such as a criteria validation system, consider both high precision and high recall to be extremely valuable. Including all related terms from a knowledge base, may result in obtaining higher recall results; however, the overall precision of the results may suffer. Both aspects are important in information retrieval.
A criteria validation system refers to a system in which a set of unstructured text criteria is being validated or evaluated against unstructured data content to determine if the condition of the criteria is “met” or “not met” in the unstructured data. The unstructured data is often text content and the unstructured criteria often includes and/or excludes particular words or phrases. The validation determines if the conditions of the criteria are found in the unstructured data content, and if the conditions are met, the content is considered a match to the criteria, based on examination and analysis of the unstructured data content, which is sometimes referred to as evidence.
Situations in which the criteria of a query includes limited information or is specified in a particular manner, may result in difficulty making decisions of matching text content to the criteria. The recall of the query based on the limited criteria terms (or phrases) may exclude alternative labels or descriptions of the criteria, or reject text content containing alternative expression of the criteria.