This invention relates generally to the methods and systems for analyzing data records and, more particularly, to methods and systems for mining text information.
Large volumes of data have become more accessible particularly with the increased data access capabilities of the Internet. The traditional method of manually reading large volumes of documents to analyze the content and relevance is often too time consuming for most.
A number of methods have been developed for evaluating text documents. These methods typically require a document to be in electronic form to enable a computerized search and comparison of terms. At a rudimentary level, a user simply inputs query words or phrases into a system, and documents matching the query are returned. However, the user must manually read the returned documents to determine the content or relatedness of the documents. At a more sophisticated level, methods or algorithms have been developed that evaluate the association of words in documents and/or the content of documents, from which comparisons are made and reported to a user.
These methods, however, have been limited to analyzing text from a particular type of document. Furthermore, these text processing techniques do not enable a specific segment of the data to be analyzed individually nor in combination with other segments of the document. For example, patents contain several distinct sections, as do documents that are in table form, for which individual section comparisons may be desired. However, prior text-based analysis methods do not distinguish segments of such documents for comparison.
Text processing for text analysis includes two basic steps. First, the words used are indexed for rapid interactions and retrieval. Second, a vector, or high-dimensional mathematical signature, is created for each record. This vector is used for subsequent clustering and other analyses.
Many methods for text analysis rely on some method for feature extraction, that is, determination of which words are best for use in comparing one document against another or for cluster analysis. In one word-based approach, U.S. Pat. No. 5,325,298 discusses the derivation of context vectors (how each word associatesxe2x80x94either positively or negativelyxe2x80x94with other words in the overall vocabulary) and the subsequent derivation of summary vectors that describe each text document. All the context vectors are the same length and the summary vectors are also constrained to this same length. This method can also use predefined word lists. Common words (stopwords) are predefined and are eliminated as a means for keeping the context vectors small.
Similarly, Larsen and Aone, Fast and effective text mining using linear-time document cluster, Proceedings of the Fifth ACM SIGKDD International conference on Knowledge Discovery and Data Mining, pp. 16-22, 1999, also use word-based methods to define features. They extract a list of unique terms from each document, assign a weight to each of those terms, and represent the documents using the highest-weighted terms. After removing stopwords, they assign the term weights by term frequency and inverse document frequency.
In a variation of this approach, the words identified during feature extraction were used in one dimension of a two-dimensional matrix with the other dimension being all words that discriminate document content. The values in the resulting matrix are the conditional probability that a document will contain both words represented at that position. (See, for example, Wise et al., Visualizing the Non-Visual: Spatial analysis and interaction with information from text documents, Proc IEEE Visualization 95, pp. 51-58, 1995; Wise, The Ecological Approach to Text Visualization, JASIS 50:1224-1233; Wise et al., Visualizing the Non-Visual; Spatial Analysis and Interaction with Information From Text Documents, Proc IEEE Visualization 95, N. Gerson, S. Eick (Eds.), IEEE Computer Society Press, Los Alamitos, Calif., pp. 51-58.
Instead of using the entire vocabulary as the basis for feature extraction, Conrad and Utt, A System For Discovering Relationships By Feature Extraction From Text Database, Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 260-270, 1994, have described a concept-oriented approach. In this case, the method is focused on general features that can be recognized by relatively simple methods, such as people and locations.
Combining these word- and concept-based approaches, Dorre et al., Text Mining: Finding Nuggets in Mountains of Textual Data, Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 398-401, 1999, describe a feature extraction method that is employed in IBM""s Intelligent Miner for Text. This approach recognizes and classifies significant vocabulary items using methods for recognition of proper names and pattern matching. The methods automatically classify names, organizations, and places, multiword terms, abbreviations, numerical forms of numbers, and several other features. This information is then used for clustering the text documents as a means for grouping like documents together.
U.S. Pat. No. 5,963,965 to Vogel discusses a method that generates a list of words or phrases, together comprising the lexicon, with particular emphasis on two-word phrases. Each document is compared to the lexicon to generate an index, which can be optionally shortened through the use of a filter word list. The words and phrases are then grouped together to generate clusters based on predetermined relationship between the phrases.
The above methods are generally a single pass analysis. U.S. Pat. Nos. 5,687,364 and 5,659,766 to Saund et al. have also represented relationship between words and phrases as word clusters and association strength values. However, in this case, a training set of documents is used to iteratively test the correlation between current association values and topical content, with modification of the association strengths as needed.
Word-based methods can also be combined with other approaches. For example, U.S. Pat. No. 6,038,561 to Snyder et al. describes the combined multiple methods of analysis, where different representations of each document are used for comparison.
While prior text processing and analysis systems enable the analyses and comparisons of the text documents, these systems are directed to analyzing and comparing documents as a whole and of a particular type. The systems analyze the text of the document as a whole without analyzing the text as it pertains to distinct columns, cells, sections, or divisions. Thus, there is a need in the art for a text analysis and processing system that enables various divisions or sections of document to be separately analyzed.
Generally described, methods and systems consistent with the present invention provide a text analysis and processing system for that enables various divisions or sections of data records to be separately catalogued, indexed, or vectorized for analysis in a text processing system.
More particularly, a text processing method or system consistent with the present invention receives a plurality of data records, where each data record has a plurality of attribute fields associated with the records. The attribute fields containing textual information are identified. The specific textual content of each attribute field is identified. An index is generated that associates the textual content contained in each attribute field with the attribute field containing the textual content. The index is operable for use in text processing.
The plurality of data records may be located in a data table and the textual information may be contained within cells of the data table. The textual information is indexed in a manner that enables the textual information contained within different attribute fields to be compared. A vector may be generated that differentiates the content of data records based on textual content contained in the attribute fields. If desired, only a selected number of the attribute fields containing textual information are used to generate the vector. A user selectable command may be received for generating the index with textual information indexed either based on the case of the textual information or not based on the case of the textual information.
In another aspect consistent with the present invention, a plurality of data records is received, where at least some of the data records contain text terms. A first method is applied to weight text terms of the data records in a first manner to aid in distinguishing records from each other in response to selection of the first method. A second method is applied to weight text terms of the data records in a second manner to aid in distinguishing records from each other in response to selection of the second method. A vector is generated to distinguish each of the data records based on the text terms weighted by the first or second method that was selected.
The weighting may be based only on text terms corresponding to selected criteria. In the case of a data table, the selected criteria may be based on columns selected from the data table. The first and second methods may be topicality calculation methods.