Due to the rapid expansion of electronic media storage capabilities, research analysts now have available massive databases of information which can be analyzed for various purposes. A major problem in practice is how to extract the essential information from these databases in a form readily amenable to analysis and interpretation. In other words, how does the analyst extract the collective wisdom contained in these large databases in a concise, readily understandable form.
The user may directly interact with the process of the present invention to select the often repeated phrases of high user interest referred to herein as pervasive theme areas (PTAs). Or, at the start of the process provide criteria for selection of the PTAs to enable the process of the present invention to define and select the PTAs from the sorted list content. Until recently, the co-word methodologies used key or index or title words as input, rather than input from the text directly. This led to universal criticism that the method input was subject to sources of bias and error from the indexing process, and the validity of the final results were questioned. In addition, the text database tended to consist of published papers, which had the index and key words. This also limited the types of text available for analysis. Recently, some progress has been reported in computer-assisted indexing using only the words found in the published literature.
Modern quantitative techniques utilize computer technology extensively, usually supplemented by network analytic approaches, and attempt to integrate disparate fields of information. One class of techniques exploits the use of co-occurrence phenomena. In co-occurrence analysis, phenomena that occur together frequently in some domain are assumed to be related, and the strength of that relationship is assumed to be related to the co-occurrence frequency. Networks of these co-occurring phenomena are constructed, and then maps of evolving topic fields are generated using the link-node values of the networks. Using these maps of structure and evolution, the information analyst can develop a deeper understanding of the interrelationships among the different information fields and the impacts of external intervention, and can recommend new directions for more desirable information portfolios.
One approach to co-occurrence analysis is co-word analysis. The origins of Co-word phenomena can be traced back to the pioneering work in: 1) lexicography to account for co-occurrence knowledge, and 2) linguistics to describe how affinity of two language units correlates with their appearance in the language.
In early co-word studies, words were classified on the basis of their co-occurrence with other words as well as their meanings. It was, however, observed that the reasons for two words co-occurring in the same context are not always relevant to a general linguistic description of a given language. The well-formedness of sentences to their lexical levels; i.e., how sensitive the meaning of a sentence is to substitution for one member of co-occurrence pair has been studied. A recent study included collocations as part of a linguistic model, whose goal was to relate any given meaning and all the texts that express it. Information retrieval research has focused on designing more efficient indexing tools using pairwise lexical affinities instead of keywords. Methods have been developed for locating interesting collocational expressions in a large body of text. These methods were based principally on the distribution of types and tokens in the body of text and on the analysis of the statistical patterns of neighboring words.
In the mid-1970s, a study was performed to examine relationships among themes in a novel using co-occurrence phenomena. An important term in the book was chosen, and a dictionary was constructed of all words in the book occurring in the same sentences as that word. A co-occurrence matrix which contained the co-occurrences among these related terms was constructed, and analyzed to eventually show the relations among all the associated terms in the mini-dictionary as they occurred in the original text. While the dictionary was restricted to single words, and the co-occurrence domain was restricted to sentences, the methodology did represent a major step forward in extracting word relations from text by their co-occurrences.
A recent update of this method employed frequency of co-occurrence to extract relatedness information from text. The study looked at co-occurrence using the sense-definition as the textual unit (entire definition of a sense of a word). The database used was the Longman Dictionary of Contemporary English (LDOCE) rather than free text. The method used single word frequencies only, and resulted in construction of networks of related words. It was concluded that co-occurrences of words in the LDOCE-controlled vocabulary in the definitions in LDOCE appeared to provide some useful information about the meanings of those words. Co-occurrences frequency correlated significantly with human judgements of relatedness, and the relatedness functions on co-occurrences yielded even higher correlations.
While the methods described above were useful for showing how relations among words and terms could be quantified and extracted from text, none were applied to extract relationships among topics of interest from large text databases.
It has been concluded in a recent study that co-word analysis may be satisfactorily performed on a set of documents by using either title words or keywords and that the main difference between results obtained is that keywords provided a much more detailed account of the subject studied for the case in point.
While there has been some progress in overcoming the dependency of co-word analysis on key or index words, limitations remain. From the above it is clear that the prior art has taken many varied approaches to search out related topic information from individual or multiple target databases. In light of the various problems and limitations of the foregoing approaches a simpler methodology less dependent on user bias and influences was imperative.