This invention relates to a computer implemented innovation process. More particularly, it relates to the automated retrieval of analogies useful in constructing innovative solutions to problems, and in identifying novel potential applications for existing products. The method of the invention identifies analogous structures, situations and relationships in semantically distant knowledge domains by creating abstract representations of content (vectors) which are characteristic of a given domain of knowledge (source domain) and searching for similar representations in semantically distant (target) domains. The process of this invention does not depend on symbol (i.e., key word) matching.
The ability to recognize novel but relevant analogies from other knowledge domains as clues to potential solutions of problems in a given domain has been a valuable skill in industry. It has long been regarded as a central feature of innovative thinking. Likewise, the ability to recognize new and practical applications for existing products has long been regarded as a valuable talent in marketing. People with such skills are expensive to produce, and can generally operate in a limited range of knowledge domains (i.e., those specialist domains in which they were trained, or had experience). Another serious limitation is in the ability to process large volumes of information, in order to identify relevant analogous situations which have not previously been recognized. Raw information must be analyzed in considerable detail, but there are limits to how much information human beings can process effectively. To make matters worse, most of the raw information used by people in technical and marketing fields is in the form of text. Reading of text is an especially slow and tedious means for acquiring new information. This has resulted in the problem of information overload.
Computer implemented processes for information storage, transmission, and retrieval have accelerated the pace of technological change and increased the intensity of competition in business. Such processes have made information of all kinds much more widely available, and greatly increased the speed at which information can be transmitted. However, most present day methods for retrieving electronically stored information rely on matching of symbols (such as key words) and, to this extent, such systems have made the problem of information overload worse.
Different knowledge domains use different symbols, and even those symbols which are common to most knowledge domains (such as the commonly used words in human languagesxe2x80x94which account for most of the words in specialist text) can have different meanings in different areas of knowledge. Variations in the meanings of common symbols (such as most words in human languages) from one area of knowledge to another may be radical, or quite subtle. To make matters worse, these relationships are dynamic. As such, it is extremely difficult to pin down, at any particular time, what a given symbol means across a range of situations. Such domain specific variations in meaning, combined with a proliferation of new specialist terms, has forced people to specialize more and more narrowly over time. This increasing specialization has, in turn, accelerated the speciation of new meanings for existing symbols (i.e., words), and new specialist terms.
The trade-off between precision and recall in conventional xe2x80x9ckey-wordxe2x80x9d search technology is well known. These types of systems only retrieve records or documents containing exact symbol or word matches, but take no account of context. Unless one searches very narrowly (i.e., for a few domain specific terms, or other specific groups of words in close proximity), one obtains mostly non-relevant material. Because ideas can be expressed in many different ways using different words, a narrow keyword search can miss most of the relevant records. Searching more broadly may require domain specific knowledge on the part of the user (i.e., as to the relevant synonyms for words used in the query, and different ways of expressing related ideas in different domains). Broader searching, however, brings in additional irrelevant material. Searching broadly in several different semantically variant knowledge domains can easily bring in so much irrelevant material that the user is overloaded, and the information retrieved is therefore useless. Another significant disadvantage of systems which retrieve information by symbol matching is that they tend to retrieve only information that is already known. Unless the user is truly an expert in a range of different areas of knowledge, it is almost impossible to use this kind of technology to make connections that are both relevant and novel (i.e. innovative connections). For example, if one searches a large technical database in order to find applications of polyurethanes in telecommunications, one may enter the Boolean expression xe2x80x9cpolyurethanes AND telecommunicationsxe2x80x9d as a rather broad search strategy. This will retrieve a handful of very well known applications of polyurethanes in the telecommunications field, plus a larger number of irrelevant records (where the two terms co-occur, but in unrelated ways). One might also use index based search techniques such as subject codes, or art groups, but these tend to focus the results even more tightly on the known applications. It is not possible, by these methods, to find records which although highly relevant do not contain both of the terms in the search.
The domains of telecommunications and polyurethanes are semantically distant, although there exists a small percentage of records in the telecommunications domain that do mention polyurethanes (or widely recognized synonyms thereof). It would be a simple matter to edit out these references to known applications (i.e., by employing the Boolean NOT operator, to create a limited set which excludes records in the telecommunications domain that mention polyurethanes). This would result in a very large and, from the viewpoint of a polyurethanes specialist, intractable mass of records dealing with telecommunications. A polyurethanes specialist who wishes to find novel and relevant applications in the telecommunications field would be faced with the choice of:
1) Becoming an expert in telecommunications,
2) Acquiring the services of a telecommunications domain expert (and bringing them up to speed on polyurethanes technology), or
3) Reading (or scanning) thousands of documents on telecommunications, with the hope of finding something relevant.
In practice, the polyurethanes domain expert would be able to take some shortcuts. He could talk to other people in the polyurethanes field who have had more experience in the telecommunications field (i.e., customers of the known applications). This is a variation of Option-2, above. The success of such an approach assumes the existence of contacts who are willing to share their knowledge. The polyurethanes expert could also limit his search of the extensive telecommunications literature by focusing on certain broad categories of applications (i.e. xe2x80x9cfoamsxe2x80x9d) which are well known for polyurethanes. This is a variation of Option-3 above, which reduces (but does not eliminate) the chances of finding truly novel and relevant applications. This latter approach may still result in an intractably large body of records which must be read.
The above example is an illustration of the difficulties in finding semantically distant analogies which are both useful and novel, as a means for solving problems and developing new end use applications. In its present form, the process of innovation by analogy is highly dependent on chance associations (i.e., contacts between the right people with relevant expertise; coming upon relevant records xe2x80x9cby accidentxe2x80x9d; seeing a xe2x80x9crelatedxe2x80x9d material, or procedure, or apparatus, in a different area of technology, etc.). These chance associations are difficult to control. Hence, innovation is difficult to control. Even the most ardent and well supported efforts at innovative problem solving are, at best, extremely high risk propositions. Although the value of innovation, and cross-domain (xe2x80x9cinterdisciplinaryxe2x80x9d) collaboration is well known, recent trends have been towards sharper focus on narrower knowledge domains (i.e., focusing on xe2x80x9ccore competenciesxe2x80x9d, and extremely short time horizons). Innovation is fundamentally a process of making and filtering connections.
There has been a proliferation of advanced systems for text retrieval during the 1990""s. In spite of this trend, most of the larger commercial sources of on-line (electronic) text based information will only support the conventional xe2x80x9ckey-wordxe2x80x9d based retrieval technology. Given that most of the available content is only available in this form, advanced systems for text-processing must often be used to post-process intermediate sets generated from broad key word searches. These intermediate sets may be quite large, and their creation provides a convenient (albeit approximate) means for differentiating between different knowledge domains. Any general method for extraction of semantically distant analogies (from textual data) would need to be capable of operating in this two-step mode, if it is to be implemented in the foreseeable future.
Advanced text search tools are particularly helpful in post processing large intermediate results sets (i.e., by reducing the amount of material which must be read by the user). Likewise, the limited sets can sometimes help to further focus the output of an advanced search engine.
Unfortunately, most of the currently available xe2x80x9cadvancedxe2x80x9d text search tools are limited in their ability to handle highly specialized subject matter (i.e., technical material). Most of the current systems contain fixed machine readable thesauri, which they use in order to identify words of related meaning (i.e., synonyms) to expand terms in user queries. Although these fixed, manually constructed thesauri sometimes contain different meanings (i.e., different definitions, or examples of differing usage) for the same words (which meanings can be selected by users, in constructing searches), the range of available meanings is predetermined and quite limited. Commercial systems of this type rarely contain specialized technical terms such as xe2x80x9cpolyisocyanuratexe2x80x9d, or xe2x80x9cpolyisocyanatexe2x80x9d, or xe2x80x9cmagnetohydrodynamicxe2x80x9d, or xe2x80x9celastomerxe2x80x9d. Thesaurus based search engines can be expanded, however, this requires considerable manual effort and often the services of experts from all the domains in which the system is expected to perform. Moreover, the user may be forced to make an inconveniently large number of word-meaning choices in constructing searches with such an xe2x80x9cexpandedxe2x80x9d system. The user may not, however, always have the knowledge to choose correctly, since word meanings vary considerably between different specialist knowledge domains. Text search tools of this class are particularly well suited to non-specialist text domains such as databases of newspaper articles. Existing search tools give some consideration to word context, a major advance over simple xe2x80x9ckey symbolxe2x80x9d (key word) matching.
A significant disadvantage of fixed thesaurus-based text search systems is their inherent bias (as to what words it considers to be xe2x80x9cclosely relatedxe2x80x9d). As an example, in a popular search engine of this type the word xe2x80x9cspiderxe2x80x9d is shown to be synonymous with the word xe2x80x9carachnidxe2x80x9d although, strictly speaking, this is a generic relationship. The word xe2x80x9cwebxe2x80x9d, however, which most people immediately associate with the word xe2x80x9cspiderxe2x80x9d, is not listed as related. Most commercial thesaurus-based text search tools, although a significant advance over conventional xe2x80x9ckey wordxe2x80x9d based retrieval technology, operate on the principal of symbol matching.
Another class of advanced computer based text retrieval systems use abstract mathematical representations of symbols in text (symbols such as words, word stems, and phrases) in order to capture and quantitatively compare the context specific xe2x80x9cmeaningxe2x80x9d of said symbols. Systems of this type reduce the individual symbols to vectors in a high dimensional space. The relationships between the vectors in this space (xe2x80x9csemantic spacexe2x80x9d) can capture information about the co-occurrence patterns between symbols that convey what those symbols mean in a given context (latent semantics). Once this co-occurrence information has been captured, it becomes possible to compare the xe2x80x9cmeaningxe2x80x9d of individual symbols and/or higher order structures (such as groups of words or groups of word stems, sentences, paragraphs, whole documents, and even large bodies of documents) quantitatively, by using the known operations of vector mathematics. These operations include, for example, the summing and normalization of vectors (to represent higher order structures in text), the calculation of dot products between vectors, or the calculation of angles between vectors. Words and higher order structures in text are in effect replaced by a geometric representation, in which individual symbols and higher order structures become points in the high dimensional semantic space. The determination of the relative proximity of xe2x80x9cmeaningxe2x80x9d of different terms, documents, etc., is thereby reduced to measuring the proximity of the corresponding points in this space. Information retrieval becomes a mathematical operation, which does not depend on matching of symbols. Relevant documents can be retrieved from a database, even if they do not contain any of the xe2x80x9ckey wordsxe2x80x9d in the query statement. Vector based information retrieval systems can be used not only for comparing the similarity of meaning of records (such as pairs of documents), but also provide a convenient means, in principle, for the quantitative determination of the semantic distance between different domains of knowledge (i.e., wherein each of said domains is represented by a domain specific body of records, in a single xe2x80x9csemantic spacexe2x80x9d).
A number of different variations of the vector principal are known. Some systems do not depend on any other fixed (manually assembled, manually updated) sets of definitions for symbols (i.e., definitions for words in a body of text). They require only an example of text (preferably a large example) from which word relationships are automatically extracted, and represented in vector form for use in subsequent information retrieval. The text example, from which word relationships are extracted in these systems, is commonly referred to as a training corpus.
Vector based information retrieval systems also provide convenient quantitative methods for representing xe2x80x9csemantic distancexe2x80x9d (i.e., between items of text such as different documents, or groups of documents). Vector representation methods are not the only tools in existence for representing the xe2x80x9csemantic distancexe2x80x9d between records (such as documents), but they have an advantage in that the measure of semantic similarity (or distance) is independent of exactly how the documents (or document segments, or document clusters) are related. There need not be any domain specific word matches between the documents. There need not be any specific xe2x80x9cthesaurus mediatedxe2x80x9d word matches of the type described previously. The process of assessing semantic interrelatedness, or distance is driven by the natural and characteristic relationships between the words in the training corpus. Most of the words (and/or word stems) in the training corpus will be interrelated in some way, usually more than one way. Hence, there are no arbitrary boundaries to word relationships, such as those imposed by a thesaurus. If two words co-occur frequently in the training corpus [such as xe2x80x9cspiderxe2x80x9d and xe2x80x9cwebxe2x80x9d for example], then they will show up as related (having a similarity of meaning which is specific to the training corpus) in the outcomes of searches. Consequently, it would be possible to enter the term xe2x80x9cwebxe2x80x9d in a search query and retrieve documents (or relevant segments thereof) dealing with spiders and arachnidsxe2x80x94even if the term xe2x80x9cwebxe2x80x9d does not appear at all in those specific records.
The relationships between symbols, in a vector based retrieval system, can be much more comprehensive because even the most arcane and specialized terms in the training corpus can be represented, automatically, in the high dimensionality vector space. The range of possible relationships is unrestricted (since most terms in the training corpus are interrelated), and natural (i.e., specific to the training corpus, or to a particular domain of knowledge which the Corpus may represent). The possibilities for making connections in such an environment are extremely rich. Moreover, the paths through this network of connections are prioritized (some more likely to be traversed than others) because domain specific (or, at least, corpus specific) xe2x80x9crulesxe2x80x9d are encoded, automatically, in the network of word relationships captured from the training corpus.
Several different variations of the context vector principle are known. One variation is described in U.S. Pat. No. 5,619,709, and the related case, U.S. Pat. No. 5,794,178. These two references are incorporated herein by reference, in their entirety. In the preferred embodiments of the inventions described in these two references, symbols (i.e., words or word stems) from a large and domain specific or user specific example of text (a xe2x80x9ctraining corpusxe2x80x9d) are automatically reduced to vectors, said vectors representing the relationships between said symbols, which relationships are characteristic of the training corpus. Thereby, if the training corpus is a sufficiently large representative body of text from a given domain of knowledge, the relationships between the symbols in this training corpus will constitute a reliable representation of the symbol (word or word stem) relationships characteristic of that particular knowledge domain (i.e., polyurethanes). This representation will be a xe2x80x9csnapshotxe2x80x9d in time, but readily capable of being updated by repeating the training process at a later time, with an updated training corpus. In the preferred embodiments according to these references, the training (vector setting, or xe2x80x9clearningxe2x80x9d) process is conducted using a neural network algorithm. In this vector setting process, an initial set of vectors are assigned randomly to the symbols in the training corpus. The vector values are then optimized by an iterative process, described in detail in the references, whereby the final values of the vectors come to accurately represent the relationships between the symbols in the training corpus. In the event that the training corpus is a body of text then the xe2x80x9csymbolsxe2x80x9d are the words and/or word stems within that body of text. In the preferred embodiments described in the above cited references, the number of dimensions in the vector space employed is smaller than the number of words (and/or word stems) in the training corpus. This has a number of practical advantages, including the most efficient utilization of the computational resources of the computer hardware on which the system is running. Typically, a vector space of between about 200 and about 1000 dimensions is used.
Vectors are fundamentally numbers, having xe2x80x9ccomponentsxe2x80x9d in each of the dimensions of the vector space used. Reducing the xe2x80x9cmeaningxe2x80x9d of words in a body of text to vectors in a space of limited (but sufficiently large) number of dimensions has a number of unique advantages. Among these, the relative similarity of word meanings can be represented quantitatively in Context Vector Technology, CVT, by the degree to which their vectors overlap [i.e., the xe2x80x9cdot productxe2x80x9d of the individual word vectors]. If the words have similar meaning within the context of the training corpus, the dot product of their vectors will be relatively high. If they have no similarity, the dot product of their vectors will be relatively low (zero, or very close to zero). Subtle gradations of meaning between different words in the training corpus and, hence, the knowledge domain it represents can thereby be captured. The xe2x80x9cmeaningxe2x80x9d of words in the training corpus is encapsulated in the way they associate (i.e., the relationships between the words). If the training corpus for a given knowledge domain is sufficiently large, then the pattern of word relationships will be stable (i.e., will not vary significantly with sample size). This stable pattern may be captured quantitatively in the vectorization process as an accurate representation of the knowledge domain from which the training corpus was assembled. Context vector technology focuses on reducing the relationships between symbols (such as words) to mathematics (geometry). It is fundamentally different from other methods of text retrieval which are based, directly or indirectly, on symbol matching. Given that vectors can be added, it is possible to reduce the meaning of groups of words to vectors which represent the xe2x80x9cmeaningxe2x80x9d of sentences, paragraphs, documents, groups of documents, etc. As with the individual words (and/or word stems), it is possible to quantitatively compare the domain specific xe2x80x9cmeaningxe2x80x9d of such word groupings by calculating the dot products of their corresponding vectors. Likewise, queries on databases of documents can be reduced to vectors and said xe2x80x9cquery vectorsxe2x80x9d compared to the vectors of the individual documents (and/or document segments) in the database, by computation of vector dot products. The documents [or document segments] that are most similar in xe2x80x9cmeaningxe2x80x9d (i.e., having the highest dot products with the query vector) are retrieved and displayed in ranked order. In addition to simple relevance ranking, the semantic (or xe2x80x9cmeaningxe2x80x9d) relationships between documents or document segments can be represented by relative positioning on a two or three dimensional graph (visualization). Documents (or segments) of similar meaning will be clustered together on the graph, whereas those of less similar meaning will be farther apart. The distances between documents, document segments, or even clusters of documents on the space of the visual graph will be a quantitative measure of the degree to which their content is similar. This method of visualization is one good way, although not the only way of visualizing the xe2x80x9csemantic distancexe2x80x9d (i.e., between individual documents, document clusters, or whole knowledge domains).
Some additional references which are highly relevant to the xe2x80x9ccontext vectorxe2x80x9d principal for information retrieval include U.S. Pat. No. 5,675,819; U.S. Pat. No. 5,325,298; and U.S. Pat. No. 5,317,507. These patents are incorporated herein fully by reference.
As with CVT, latent semantic indexing, LSI, involves the automatic representation of terms (words, stems, and/or phrases) and documents from a large body of text as vectors in a high dimensional semantic space. The meaning (closeness in the semantic space) of documents and/or terms can be compared by measuring the cosines between the corresponding vectors. Items (terms or documents) which have similar xe2x80x9cmeaningxe2x80x9d will be represented by vectors pointing in similar directions within the high dimensionality semantic space (as measured by the cosine values). As with CVT, LSI uses an automatic process to capture the implicit higher order structure in the association of symbols in a body of text, and uses this associational (co-occurrence) structure to facilitate retrieval without depending on symbol matching. The process and its application is described in greater detail in U.S. Pat. No. 4,839,853 and in J. Am. Soc. Info. Sci., Vol. 41(6), 391-407 (1990), which are incorporated herein by reference.
The LSI process uses the technique of singular value decomposition, SVD, to decompose a large term by document matrix, as obtained from a training corpus, into a set of orthogonal factors (i.e., on the order of 100 factors) which can be used to approximate the original matrix by linear combination. The optimum number of factors (dimensions) is determined empirically (i.e., the value of about 100 is said to give the best retrieval performance). In the LSI process the number of dimensions in the original term by document matrix is substantially reduced, and then approximated by smaller matrices. This is considered critical to the performance of the process. The number of factors (dimensions) must be large enough to model the xe2x80x9creal structurexe2x80x9d of the data (the implicit higher order semantics, encapsulated within the major associational structures in the matrix) without modeling noise or unimportant details (such as small variations in word usage). The optimum number of dimensions in the semantic space (in which terms and documents are represented by vectors) is therefore a compromise. This compromise value is similar in both LSI and CVT (from 100 to several hundred dimensions).
LSI, like CVT, can in principal be used as a method for representing the xe2x80x9csemantic distancexe2x80x9d between bodies of documents. Such a representation of distance could, in principal, also be approximated in a two or three dimensional space and displayed to the user as a cluster diagram.
Queries in LSI are handled in a manner similar to that described for CVT. A vector representation of the query is calculated from the symbols (terms) in the query, and the position of this xe2x80x9cquery vectorxe2x80x9d is located in the semantic space obtained from the original SVD operation. The query thus becomes a xe2x80x9cpseudo documentxe2x80x9d in the vector space. The query vector is compared to the vectors of other documents in the space, and those documents which are xe2x80x9cclosestxe2x80x9d to the query, in the semantic space, are retrieved. As in CVT, the retrieved documents may then be displayed, in ranked order of relevance, to the user. In LSI the measurement of xe2x80x9cclosenessxe2x80x9d, between query and document vectors, is performed by comparing the cosines between vectors. The precise methodology by which queries (and other pseudo documents) are placed in the high dimensionality semantic space, as obtained from the SVD operation on the original term by document matrix (from the training corpus), is described in greater detail in U.S. Pat. No. 4,839,853 and J. Am. Soc. Infor. Sci. Vol. 41(6), 391-407 (1990). In simple terms, the method involves placing the pseudo document at the vector sum of its.corresponding term points.
Vector based retrieval technology is not without its disadvantages. For example, a CVT system which has been trained on non-specialist text, such as a newspaper database (serving as the training corpus), may not perform as well as a xe2x80x9cthesaurusxe2x80x9d based text retrieval tool when applied to searches on technically specialized bodies of text. Likewise, it is known that LSI has difficulty handling words with multiple, domain specific, meanings (polysemy). This problem becomes especially noticeable if the LSI training corpus (the body of documents used in developing the term by document matrix) covers several disparate knowledge domains. It is due to the fact that individual terms are represented as single points in space. Therefore a word that has multiple meanings in the training corpus will be assigned a position in the semantic space which is a xe2x80x9cweighted averagexe2x80x9d of all its different meanings, and may not be appropriate for any one of them. Similar types of problems may be encountered when using CVT. Use of a domain focused training corpus (in which most terms have stable patterns of association) is one way of minimizing such problems.
Although an extremely large number of different concepts may be encoded (learned) in the high dimensionality vector space, it is logistically impossible to train the system on xe2x80x9cevery conceptxe2x80x9d. The technical consequences of training such a system on a very large number of diverse knowledge domains simultaneously are not, however, fully understood.
Another disadvantage of vector based retrieval systems is a tendency to over-generalize. This is especially problematic with queries containing multiple terms (words). It can result in some circumstances in the retrieval and inappropriately high relevance ranking of more non-relevant records (documents) than certain thesaurus-based text retrieval system. This tendency for over-generalization may be due to the large number of connections (word associations) open to the systemxe2x80x94for each term in the query. For example, a query containing the term xe2x80x9chydrofluorocarbonxe2x80x9d may find documents which contain specific examples of hydrofluorocarbons but no matches on the exact term xe2x80x9chydrofluorocarbonxe2x80x9d. This kind of generalizing can be an extremely valuable feature of vector based retrieval systems. Unfortunately, the same query may retrieve documents on hydrochlorofluorocarbons and chlorofluorcarbons, said documents not mentioning anything about hydrofluorocarbons. This kind of generalization can be quite unwelcomexe2x80x94if the user is interested specifically in hydrofluorcarbons.
The ability to generalize is extremely valuable in innovation. However, it is important to have some way of controlling the extent to which the system xe2x80x9cgeneralizes.xe2x80x9d This control mechanism must be selective to allowing for more generalization on some query terms than on others. There are known methods for controlling the extent and direction of xe2x80x9cgeneralizationxe2x80x9d in vector based retrieval systems. These methods involve user feedback, as to the relevance of intermediate search results. The user may, for example, select certain records which most closely approximate his needs and employ these records or selected portions thereof as a subsequent search query (a xe2x80x9cmore likexe2x80x9d query). This kind of user feedback is also well known in thesaurus based text retrieval technology. Other kinds of user feedback actually involve a re-adjustment of vectors in response to the user""s selections (i.e., user feedback xe2x80x9ctuningxe2x80x9d of categories in CVT). This kind of feedback has a lasting effect on the system and is a learning process. This form of user feedback learning can be of particular value in forcing a vector based retrieval system to xe2x80x9cgeneralizexe2x80x9d in directions which are most appropriate to the user""s needs, but without restricting the system to specific symbol (i.e., key word) matches. It can also be used to force the system to retrieve only documents which fit a plurality of pre-tuned categories (i.e., fitting to beyond a desired threshold, for each category). Both types of user feedback are well known in the art.
It is important to recognize that the problem of xe2x80x9cover generalizationxe2x80x9d which can occur in vector based retrieval systems is quite different from w-hat happens in simple xe2x80x9csymbol matchingxe2x80x9d (i.e., xe2x80x9ckey wordxe2x80x9d search) systems. In the latter, errors result when the system retrieves records that have the right symbols, but in the wrong context. These xe2x80x9cfalse dropsxe2x80x9d seldom have any conceptual relationship with the query. In the former case, errors result when the system retrieves records that have the wrong symbols in the correct context. The erroneous records are usually quite closely related (conceptually) to the query, but at variance with the specific needs of the user. These kinds of errors are xe2x80x9cfixablexe2x80x9d through techniques such as user feedback optimization and in fact, represent an over use of a xe2x80x9cstrengthxe2x80x9d.
The value of semantically distant (or cross domain) analogies in problem solving is well recognized. There have been past attempts at the development of computer based processes for retrieving such analogies in a systematic (problem specific) way. One such method is disclosed by M. Wolverton in xe2x80x9cRetrieving Semantically Distant Analogiesxe2x80x9d [Doctoral Dissertation; Department of Computer Science; Stanford University, May, 1994], which is incorporated herein by reference. Wolverton describes a process for the searching of large multi-purpose, multi-domain knowledge bases for cross domain analogies to specific problem situations. Many of the concepts Wolverton demonstrates in his process may potentially be applicable in the context of the instant invention. Foremost among these are means for representing semantic distance, the use of spreading activation, and the application of knowledge gained in an initial search in order to re-direct subsequent searching. However, it is unclear how one would apply Wolverton""s process in the searching of raw data sources, such as text-based databases. The information (text) in these data sources is highly heterogeneous. Converting this raw data into a knowledge base format, appropriate to the Wolverton process (as described in the reference) would be an extremely labor intensive task. This would be particularly true for knowledge domains having many specialized terms and word meanings. It would be necessary to reconcile, in advance, all the different domain-specific meanings of all the terms in all the knowledge domains in which the system must operate. Clearly, this problem is closely analogous to the difficulties described above with using thesaurus-based text search systems on highly specialized bodies of text. It would be of much greater practical value to have a method for finding semantically distant analogies that could be used on highly heterogeneous bodies or raw text, without the need for manual pre-preparation or the need for defining any terms (words).
The invention is directed to a universal computer-implemented method for finding analogies to specific terms (representing compositions, relationships, structures, functions, or applications) in a first preselected (and well defined) knowledge domain (or set of knowledge domains) by searching a second preselected knowledge domain (or set of knowledge domains) which is semantically distant from the first. It is a feature of the invention that the content of said second knowledge domain (or set of domains) is retrieved in isolation from the first.
The method of the invention comprises the automated generation of an abstract representation of terms from a first user selected knowledge domain (source domain), said representations encoding (capturing, in abstract mathematical form) the co-occurrence patterns of terms characteristic of the source domain, and application of said representations to the efficient (selective) discovery of analogous objects [terms, or groups of terms, of similar meaning] in one or more semantically distant target domains. The abstract representations are most preferably vectors in a high dimensionality space. A small subset of terms (or groups of terms, such as phrases) is chosen from the source domain, said terms in the subset being substantially absent from the target domains and having substantially no known equivalents (such as synonyms) in the target domains. These source domain specific terms (in this user defined subset) are those for which xe2x80x9canalogous objectsxe2x80x9d are sought in the target domains. These analogous objects are terms or groups of terms from the target domains, which are in some way related to the chosen source domain terms (i.e., having a similar semantic role, or xe2x80x9cmeaningxe2x80x9d). The method of the invention is capable of efficiently (selectively) retrieving analogous content and ranking by degree of similarity, without any a priory specification of the nature of the analogy.