1. Field
This disclosure relates to computer readable document and file content analysis and retrieval of computer readable documents and files as well as information contained therein.
2. Description of the Related Art
There are several approaches to information retrieval. In general the goal is to find documents that best address the user's information need as expressed in a query. A Boolean approach, matches documents if they satisfy the truth conditions of a Boolean statement (e.g., budget and proposal not loss).
A vector-space model represents documents and queries as vectors, in which each position may stand for a specific word. An element in the vector is set to be nonzero if the corresponding word is in the text. The same element is set to 0 if the word is not in the text. In this approach, the relevance of a document to a query is usually measured as the cosine of the angle between the two vectors, but other measures, such as Euclidean distance, are also possible. Neural network approaches may also employ a form of the vector space model.
There has been a great deal of activity over the last several years in conceptual search. Conceptual search attempts to find documents that are conceptually related to a search query, rather than just those that contain the query terms.
One approach to conceptual search uses a knowledge structure, such as a thesaurus, taxonomy, or ontology. These knowledge structures are typically created by human knowledge engineers and may include descriptions of how words are related to one another. A thesaurus specifies the synonyms for each entry word. A taxonomy describes a hierarchy of classes, where each lower-level item is a subclass of a higher-level item. An ontology also describes categorical relations, but these do not have to be strictly hierarchical. When a user enters a query, the information structure provides additional terms that the designers have determined are conceptually related to the query terms.
Other approaches may use machine learning or statistical inference to determine how concepts are related to query terms, and thus how documents containing those concepts are related to queries. Two of these technologies are based on the vector space model.
TABLE 1Mathematical approaches to concept learningVector spaceProbabilisticWord-documentLatent Semantic IndexingNaïve Bayesian ClassifiersWord-wordNeural networkPresent language modelsystem
A well known mathematical approach to concept-related retrieval is latent semantic indexing (LSI), also called latent semantic analysis. Latent semantic indexing starts with a matrix of word—document associations. It employs the statistical procedure of singular vector decomposition to reduce the dimensionality of these associations and capture the regularities that give the system its conceptual ability. Words that are used in similar documents are represented along similar dimensions related to their meaning.
A second vector-related approach to conceptual search involves the use of a neural network. This approach starts with a matrix of word-word associations. It uses a neural network process related to the statistical procedure of principal component analysis to reduce the dimensionality of the word-word associations and to capture the regularities that give the system its conceptual ability. Words that are used in the same context as other words are represented along similar dimensions related to their meaning.
A fourth commonly used approach to conceptual search is based on the statistics of probability to derive conceptual meaning. Probabilistic approaches to conceptual search use the statistics of probability to estimate the likelihood that a document is relevant to a specific query. Generally, these models are focused on computing the probability that a document was relevant to a specific category of documents. These estimates are derived from a collection of documents that are known to be relevant to each of the categories involved.
A widely used probabilistic approach is a Bayesian classifier. These classifiers use Bayesian probability theory and a set of categorized examples to determine the likelihood that each word in a document is indicative of one category or another. It then uses this category information to provide conceptual information to the user about his or her query. This Bayesian approach limits its conceptual “knowledge” to just those categories that were used in calculating the probabilities.
Another approach to conceptual search also exploits the statistics of probability. This approach involves the use of language models to capture the conceptual relationships. A specific version of this language-modeling approach, which introduces a number of innovations, is described herein.
Throughout this description, elements appearing in figures are assigned three-digit reference designators, where the most significant digit is the figure number and the two least significant digits are specific to the element.