1. Field of Invention
This invention relates to a method and system for searching text documents.
2. Description of Related Art
Differing communities of researchers use different words and terminology to express similar ideas or describe similar products. This is especially true of technical xe2x80x9cjargon.xe2x80x9d An electronics engineer, for instance, might use the word xe2x80x9cplasticxe2x80x9d where a chemist would refer to the same compound as a xe2x80x9cpolymer.xe2x80x9d
In order to facilitate searches across these technological and taxonomic boundaries, a search system must be able to detect and utilize synonymy and polysemy to expand the original search concept into equivalent areas in other industries. Some researchers have employed simple tree or Mxc3x97N arrays of classification cross-referencing, but these are either too restrictive or too wasteful of space for an efficient implementation.
Unusual synonymic values must be captured from the search parameters themselves. If the keyword xe2x80x9cplastic,xe2x80x9d for example, is entered it should produce an avalanche of possible matches. To constrain the search, the keyword xe2x80x9cclearxe2x80x9d is added. This establishes a binding connection between clear and plastic. The search might be further constrained by adding xe2x80x9cred.xe2x80x9d Over time, different searches might reinforce the binding of xe2x80x9cclearxe2x80x9d and xe2x80x9cplastic,xe2x80x9d but random color searches would tend to degrade the binding of xe2x80x9cred.xe2x80x9d Once the relationships are established, xe2x80x9cclear plasticxe2x80x9d, xe2x80x9coptical polymerxe2x80x9d and xe2x80x9cred lensxe2x80x9d could all produce hits to the same database entry.
If the search is further constrained by specifying all hits that did NOT contain a given word, negative binding would be associated with the two keywords adding additional context to the relationship. For example, xe2x80x9cHot peppersxe2x80x9d could initially find a synonym with xe2x80x9cHeat transfer.xe2x80x9d Specifying not to include xe2x80x9cpeppersxe2x80x9d in the search would weaken the binding link in future searches where xe2x80x9chotxe2x80x9d and xe2x80x9cheatxe2x80x9d appeared to be synonyms.
Because of the customary use of jargon and keywords as the search parameters, Standard English linguistic expression parsing and classification is not required, resulting in increased search and expansion speed. Also because of jargon, standard thesaurus synonym classifications are insufficient, requiring the system to be self-organizing.
1. Advantages of the Invention
One of the advantages of the present invention is that the output of the present is unique to each user.
Another advantage of the present invention is that it ranks search results based on feedback provided by the user.
Yet anther advantage of the present invention is that it requires less processing than other search methods.
These and other advantages of the present invention may be realized by reference to the remaining portions of the specification, claims, and abstract.
2. Brief Description of the Invention
In the preferred embodiment, a keyword or words, phrase, document, or technical jargon or acronym is entered into the search routine. The input data is subjected to Latent Semantic Analysis (LSA) in order to detect the underlying conceptual framework of the input. LSA can work on any input to define a concept, although it is more effective with more data. The result of the LSA is an n-dimensional matrix representing the input parameters within a defined conceptual space.
All documents to be searched have also been subjected to analysis, and have an associated conceptual matrix. A cosine or dot-product of the two matrices results in a conceptual distance calculation. If a document falls within a given distance, it is included as relevant to the search even if it did not contain the exact keywords specified.
The relevant documents are then processed by a Hierarchical Mixture of Experts (HME) neural network configuration in order to improve the relevance of the information to the individual user. Each user on the system has a unique HME organization. Since a given user may have several different subjects that are being researched at any given time, the HME first classifies the result into conceptual groupings to determine the correct expert configuration to use. The HME then gives a relevance grade to each document depending on factors learned from previous use. The entire list, sorted by significance, is then passed to the user.
The user is able to peruse the summaries of all documents, and selects which ones to view. The data presented is layered such that more detailed information is presented at each layer. The HME is given feedback as to which documents are viewed or not viewed, and the level of detail requested by the user. This feedback is used by the HME to learn the differences between the relevant and irrelevant documentsxe2x80x94as defined by the user himselfxe2x80x94and do a better job of ranking the documents in the next search.
Because the HME is unique to each user, different users will get different significant documents even if they use the same keywords in the search. The LSA, however, is common to all users. The system can also be used in reverse. Given a document, the HME and LSA analysis can be used to find the user who would be most interested in that particular document, even if several had used the same keywords.
The number of nodes and connections in the data structure are enormous, tens to hundreds of thousands. LSA automatically extracts the significant keywords from the entire collection. In addition, more than two keywords can be associated in strong bindings. This prohibits the use of simple Mxc3x97N matrices for binding, but requires strict limitations on the topological distance between concepts used in the process.
Because of the unique implementation of the HME in conjunction with the LSA matrix, two factors help to minimize the processing required. First, the HME is grown with new conceptual association nodes and experts only as the user begins searches of new concepts. The HME is also pruned over time to remove old search concepts. Of the hundreds or even thousands of significant terms in the LSA matrix, only a few of them are actually relevant to a given conceptual group of documents. The HME expert nodes also prune out consideration of insignificant term pairs in the matrix so that only the significant terms are expressed in the architecture.
The above description sets forth, rather broadly, the more important features of the present invention so that the detailed description of the preferred embodiment that follows may be better understood and contributions of the present invention to the art may be better appreciated. There are, of course, additional features of the invention that will be described below and will form the subject matter of claims. In this respect, before explaining at least one preferred embodiment of the invention in detail, it is to be understood that the invention is not limited in its application to the details of the construction and to the arrangement of the components set forth in the following description or as illustrated in the drawings. The invention is capable of other embodiments and of being practiced and carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting.