1. Field of the Invention
The present invention relates generally to literature searching and more specifically to the extraction of useful information from large text databases.
2. Description of the Background Art
Data mining is the extraction of useful information from any type of data. In modern context, it is the employment of sophisticated computer algorithms to extract useful information from large quantities of data. Text mining is an analogous procedure applied to large volumes of free unstructured text. S&T (Science and Technology) text mining is the application of text mining to highly detailed technical material. It is the primary technique for extracting useful information from the global technology literature.
The added complexity of text mining relative to data mining stems from the multiple meanings and interpretation of language, and their intrinsic dependence on context. The further complexity of S&T text mining relative to text mining of non-technical material arises from the need to generate a lexicon for each technical area mined, and the need to have technical experts participate in the analysis of the technical material.
There are three major components of S&T text mining.
1) Information Retrieval
2) Information Processing
3) Information Integration
Information retrieval is the selection of relevant documents or text segments from source text databases for further processing. Information processing is the application of bibliometric and computational linguistics and clustering techniques to the retrieved text to typically provide ordering, classification, and quantification to the formerly unstructured material. Information integration combines the computer output with the human cognitive processes to produce a greater understanding of the technical areas of interest.
Underlying these three text mining components are five conditions required for high quality text mining. The quality of a text mining study cannot exceed the quality of any of these conditions.
1) A large fraction of the S&T conducted globally must be documented (INFORMATION COMPREHENSIVENESS).
2) The documentation describing each S&T project must have sufficient information content to satisfy the analysis requirements (INFORMATION QUALITY).
3) A large fraction of these documents must be retrieved for analysis (INFORMATION RETRIEVAL).
4) Techniques and protocols must be available for extracting useful information from the retrieved documents (INFORMATION EXTRACTION).
5) Technical domain and information technology experts must be closely involved with every step of the information retrieval and extraction processes (TECHNICAL EXPERTISE).
The approaches presently used by the majority of the technical community to address all five of these requirements have serious deficiencies.
1) Information Comprehensiveness is limited because there are many more disincentives than incentives for publishing S&T results. Except for academic researchers working on unclassified and non-proprietary projects, the remainder of S&T performers have little motivation for documenting their output.
a) For truly breakthrough research, from which the performer would be able to profit substantially, the incentives are to conceal rather than reveal.
b) For research that aims to uncover product problems, there is little motivation (from the vendor, sponsor, or developer) to advertise or amplify the mistakes made or the shortcuts taken.
c) For highly focused S&T, the objective is to transition to a saleable product as quickly as possible; no rewards are forthcoming for documentation, and the time required for documentation reduces the time available for development.
Therefore, only a very modest fraction of S&T performed ever gets documented. Of the performed S&T that is documented, only a very modest fraction is included in the major databases. The contents of these knowledge repositories are determined by the database developers, not the S&T sponsors or the potential database users.
Of the documented S&T in the major databases, only a very modest fraction is realistically accessible by the users. The databases are expensive to access, not very many people know of their existence, the interface formats are not standardized, and many of the search engines are not user-friendly.
Insufficient documentation is not an academic issue; in a variety of ways, it retards the progress of future S&T and results in duplication.
2) Information Quality is limited because uniform guidelines do not exist for contents of the major text fields in database records (Abstracts, Titles, Keywords, Descriptors), and because of logic, clarity, and stylistic writing differences. The medical community has some advantage over the non-medical technical community in this area, since many medical journals require the use of Abstracts that contain a threshold number of canonical categories (Structured Abstracts), while almost all non-medical technical journals do not.
Compatibility among the contents of all record text fields is not yet a requirement. As our studies have shown, this incompatibility can lead to different perspectives of a technical topic, depending on which record field is analyzed. This field consonance condition is frequently violated, because the Keyword, Title and Abstract fields are used by their creators for different purposes. This violation can lead to confusion and inconsistency among the readers.
3) Information Retrieval is limited because time, cost, technical expertise, and substantial detailed technical analyses are required to retrieve the full scope of related records in a comprehensive and high relevance fraction process. Of all the roadblocks addressed in this section, this is the one that attracts probably the most attention from the Information Technology (IT) community. Because much of the IT community's focus is on selling search engine software, and automating the information retrieval process, they bypass the ‘elbow grease’ component required to get comprehensive and high signal-to-noise retrieval.
4) Information Extraction is limited because the automated phrase extraction algorithms, required to convert the free text to phrases and frequencies of occurrence as a necessary first step in the text mining process, leave much to be desired. This is especially true for S&T free text, which the computer views as essentially a foreign language due to the extensive use of technical jargon. Both a lexicon and technical experts from many diverse disciplines are required for credible information extraction.
Poor performance by the automated phrase extraction algorithms can result in:                lost candidate query terms for semi-automated information retrieval;        lost new concepts for literature-based discovery;        generation of incomplete taxonomies for classifying the technical discipline of interest, and;        incorrect concept clustering.        
For clustering in particular, the non-retrieval of critical technical phrases by the phrase extractor will result in artificial cluster fragmentation. Conversely, the retention of non-technical phrases by the phrase extractor will result in the generation of artificial mega-clusters.