1. Field of the Invention
The present invention relates generally to data processing and more generally to document searching. The present invention more particularly relates to using a representational semantic space for comparing the similarity of document representations.
2. Description of Related Art
Information is being created and made available to the public at a greater rate than ever before, due for the most part, to advances in information technology and electronic communications systems. There is so much information that it is simply not possible for an individual person to read it all, much less remember its content, context or the semantic concepts that are associated with it. Still, many times an enterprise relies on certain information items reaching certain enterprise members. Even before the recent advances in information technology and electronic communications systems, it was understood that there was a need for efficient and accurate information retrieval. The result was myriad information retrieval processes and strategies that worked relatively well under a narrowly-defined set of searching conditions.
Probably the simplest of all information retrieval techniques is word or term searching. Term searching involves a user querying a corpus of documents containing information for a specific term or word. A resultant solution set of documents is then identified as containing the search term. Single term searching is extremely fast and efficient because little computational effort is involved in the query process, but it can result in a relatively large solution set of unranked documents being returned. Many of the documents may not be relevant to the user because, although the term occurs in the solution document, it is out of context with the user's intended meaning. This often precipitates the user to perform a secondary search for a more relevant solution set of documents. Even if all of the resultant documents use the search term in the proper context, the user has no way of judging which documents in the solution set are more relevant than others. Additionally, a user must have some knowledge of the subject matter of the search topic for the search to be efficient (e.g., a less relevant word in a query could produce a solution set of less relevant documents and cause more relevant documents in the solution set to be ranked lower than more relevant documents).
Another pitfall of term searching is that traditional term searching overlooks many potential relevant documents because of case and inflectional prefixes and suffixes added to a root word as that word is used in a document, even though those words are used in context with the user's intent (e.g., “walking,” “walks,” “walker” have in common the root “walk” and the affixes -ing, -s, and -er). Searching for additional affixes of a root or base word is called “word stemming.” Often word stemming is incorporated in a search tool as an automated function but some others require the user to identify the manner in which a query term should be stemmed, usually by manually inserting term “truncaters” or “wildcards” into the search query. Word stemming merely increases the possibility that a document be listed as relevant, thereby further increasing the size of the solution set without addressing the other shortcomings of term searching.
A logical improvement to single term searching is multiple, simultaneous term searching using “Boolean term operators,” or simply, Boolean operators. A Boolean retrieval query passes through a corpus of documents by linking search terms together with Boolean operators such as AND, OR and NOT. The solution set of documents is smaller than single term searches and all returned documents are normally ranked equally with respect to relevance. The Boolean method of term searching might be the most widely used information retrieval process and is often used in search engines available on the Internet because it is fast, uncomplicated and easy to implement in a remote online environment. However, the Boolean search method carries with it many of the shortcomings of term searching. The user has to have some knowledge of the search topic for the search to be efficient in order to avoid relevant documents being ranked as non-relevant and visa versa. Furthermore, since the returned documents are not ranked, the user may be tempted to reduce the size of the solution set by including more Boolean linked search terms. However, increasing the number of search terms in the query narrows the scope of the search and thereby increases the risk that a relevant document is missed. Still again, all to documents in the solution set are ranked equally.
Other information retrieval processes were devised that extended and refined the Boolean term searching method to, although not necessarily reducing the size of the solution set, attempt to rank the resultant documents. One such effort is by term weighting the query terms and/or term weighting the occurrence of terms in the solution set of documents by frequency. Expanded term weighting operations make ranking of documents possible by assigning weights to search terms used in the query. Documents that are returned with higher ranking search terms are themselves ranked higher in relevance. However, a more useful variation of term weighting is by occurrence frequency in resultant document database, thereby allowing the documents in the solution set to be ranked. A higher term occurrence frequency in a resultant document is indicative of relevance. Boolean searching variants that allow for term frequency based document ranking have a number of drawbacks, the most obvious of which is that longer documents have a higher probability of being ranked higher in the solution set without a corresponding increase in relevance. Additionally, because the occurrence of the term is not context related, higher ranking documents are not necessarily more relevant to the user. Moreover, if the user does not have an understanding of the subject matter being searched, the combination of Boolean logic and term weighting may exclude the most relevant documents from the solution and simultaneously under rank the most relevant documents to the subject matter. Additionally, certain techniques are language dependent, for instance word stemming and thesaurus.
Information retrieval methods have been devised that combine Boolean logic with other techniques such as content-based navigation, where shared terms from previously attained documents are used to refine and expand the query. Additionally, Boolean operators have been replaced with fuzzy operators that recognize more than simple true and false values and weighted query expansion has been accomplished using a thesaurus. Thesaurus or a dictionary is a common way to expand queries and can be used to broaden the meaning of the term as well as narrowing it down or simply finding related terms. The main problem of using a thesaurus is that terms have different meanings, depending upon the subject. Thesaurus is therefore often used in databases within a special field like pharmacological and biomedical databases where they can be constructed manually. While each of the above described improvements provide some benefit to the user, the solution set of documents does not optimally convey the right information to a user if the user does not have an understanding of the subject matter being searched.