The present invention relates to a data display method and a data display apparatus in which various data is acquired, from a data base of documents beforehand registered thereto, for a set of specified documents and the acquired data is displayed.
With recent development of word processors, personal computers, and the like, the amount of electronic information generated by such word processors and personal computers are increasing. Moreover, the amount of electronic information available via worldwide web (WWW), e-mail, newswire, and the like are rapidly increasing. In firms and companies, it is quite important to analyze the contents of such electronic information for efficient use thereof.
In general, most electronic information is described in texts, that is, in a format of statements. The text information, for example, the contents of a questionnaire of free answer type cannot be easily analyzed by computers or the like and hence have been heretofore analyzed by human power. However, the information analysis by human power is attended with problems as follows. (1) The pertinent person in charge of analysis must read all documents for the processing. Therefore, when the amount of documents is largely increased, this method is not practical. (2) The information analysis is carried out according to subjective judgement of the user. Therefore, the results of information analysis vary depending on knowledge and skill of the user.
Therefore, an increasing need exists for a text mining technique as a technique to support the information analysis by human power. Agrawal et al U.S. Pat. No. 6,006,223 entitled xe2x80x9cMapping Words, Phrases Using Sequential-Pattern To Find User Specific Trends In a Text Databasexe2x80x9d issued on Dec. 21, 1999 concretely describes a processing procedure of text mining. This will be referred to as prior art 1 herebelow. In the text mining, a search or retrieval is made through text information beforehand registered to detect new knowledge according to, for example, or coincidence of words and phrases, a tendency of occurrence of words and phrases contained in the information to be processed. Specifically, for a set of processing objective documents, an analysis axis representing points of view for analysis is set to acquire words and phrases representing features or characteristics of a set of documents according to a correspondence to constituent components of the analysis axis. In this expression, xe2x80x9cto acquire words and phrases according to a correspondence to constituent components of the analysis axisxe2x80x9d means, for example, xe2x80x9cto acquire words and phrases which cooccur in a predetermined range with constituent components of the analysis axis.xe2x80x9d By referring to the words and phrases, the user can recognize a tendency of a set of documents. FIG. 2 shows an example of analysis in which a set of news items of xe2x80x9c0157xe2x80x9d in newspapers are analyzed using xe2x80x9cthe month of report or publication of the pertinent news itemxe2x80x9d as the analysis axis. That is, the analysis condition is expressed as xe2x80x9cnews item reported in xe2x80x98Julyxe2x80x99xe2x80x9d, xe2x80x9cnews item reported in xe2x80x98Augustxe2x80x99xe2x80x9d, and the like. In the analysis using the publication month as the analysis axis, words xe2x80x9cinfection, patient, symptom, hospitalization, etc.xe2x80x9d are acquired in association with xe2x80x9cJulyxe2x80x9d as a component of the analysis axis; words xe2x80x9cdamage, provision of means, hospitalization, group infection, etc.xe2x80x9d are acquired in association with xe2x80x9cAugustxe2x80x9d as a component of the analysis axis; words xe2x80x9csales amount, minus, foods, perishable, etc.xe2x80x9d are acquired in association with xe2x80x9cSeptemberxe2x80x9d as a component of the analysis axis; and so on. By referring to the words, the user can obtain a tendency that the set of documents contains topics: xe2x80x9cPatients infected with xe2x80x9c0157 disease-causing bacteriaxe2x80x9d are hospitalizedxe2x80x9d in xe2x80x9cJulyxe2x80x9d, xe2x80x9cGroup infection with xe2x80x9c0157 bacteriaxe2x80x9d through provision of mealsxe2x80x9d in xe2x80x9cAugustxe2x80x9d, and xe2x80x9cSales amount of perishable foods and the like lowered due to influence of 0157xe2x80x9d.
FIG. 3 shows an example of a processing procedure of prior art 1 in a problem analysis diagram (PAD). In step 300, a set of documents is specified as an object of the text mining. In a case of a questionnaire in which a pertinent document database contains documents collected according to predetermined points of view, the database is directly specified as an objective document set. In a case of items of newspapers in which the database contains documents gathered according to various points of view such as politics, economy, sports, and the like, a full text search is conducted according to an analysis purpose of the user to specify a set of documents. xe2x80x9cA full text searchxe2x80x9d is a technique in which all texts of the documents as the processing objects are inputted to a pertinent computer system to thereby generate a database in a registration stage. In a retrieval stage, in response to a character string specified by the user, all documents containing the character string are retrieved from the database. For example, Kato et al U.S. Pat. No. 6,094,647 entitled xe2x80x9cPresearch Type Document Search Method and Apparatusxe2x80x9d assigned to the present assignee describes the full text search in detail. This technique will be referred to as prior art 2 herebelow. In step 301, characteristic words and phrases, namely, words and phrases which characterize the contents are extracted from the set of documents specified in step 300. The characteristic words and phrases may be extracted by referring to a dictionary or by using statistical information. The characteristic words and phrases are not limited to words. For example, when the dictionary contains a complex word including two or more words, for example, xe2x80x9cdisease-causing colon bacillusxe2x80x9d, the characteristic words and phrases extracted in step 301 may include tow or more words. Conversely, the characteristic words and phrases to be extracted may be limited to a word. In step 302, an analysis axis is set as points of view for the analysis. In this example, xe2x80x9cdatexe2x80x9d, xe2x80x9cagexe2x80x9d, xe2x80x9csexxe2x80x9d, and the like assigned as bibliographical information items of a document are specified as the analysis axis or words and phrases specified by the user are set as constituent components of the analysis axis. For example, when it is desired to acquire difference of awareness or consciousness by age from a questionnaire, the age is set as the analysis axis. In this situation, values representing ages such as xe2x80x9c20xe2x80x9d and xe2x80x9c30xe2x80x9d are specified as components of the analysis axis. Finally, in step 303, processing of step 304 is repeatedly executed for the components of the analysis axis set in step 302. In step 304, a search is made through the characteristic words and phrases extracted in step 301 to extract words and phrases strongly related to the components of the analysis axis, for example, a cooccurrence word/phrase which cooccurs in a predetermined range. The predetermined range is specified, for example, xe2x80x9cwithin one documentxe2x80x9d, xe2x80x9cwithin one paragraphxe2x80x9d, xe2x80x9cwithin one sentencexe2x80x9d or xe2x80x9cwithin m or n words (m and n are integers).xe2x80x9d In prior art 1, words and phrases are obtained by establishing correspondence to the components of the analysis axis to thereby help the user recognize a tendency of the set of documents. As above, since the words and phrases characterizing the pertinent set of documents are automatically obtained by establishing correspondence to the components of the analysis axis in prior art 1, the load imposed on the user can be reduced and the difference in the analysis results between users can be minimized.
According to prior art 1, the words and phrases characterizing the pertinent set of documents are automatically obtained by establishing correspondence to the components of the analysis axis. Therefore, it is possibly to minimize the load imposed on the user described above, and the fluctuation or dispersion of the analysis resultant from respective knowledge and skill of users can be minimized.
However, prior art 1 is attended with a problem as below. As can be seen from an analysis example of FIG. 4, when the words and phrases with a high frequency of cooccurrence with each component of the analysis axis are simply extracted from the set of documents, the same words and phrases italicized in FIG. 4 such as xe2x80x9cdisease-causing colon bacillusxe2x80x9d, xe2x80x9cfood poisoningxe2x80x9d, xe2x80x9cinfectionxe2x80x9d and xe2x80x9cgroupxe2x80x9d are extracted for any component. That is, cooccurrence words and phrases such as xe2x80x9cpatientxe2x80x9d and xe2x80x9csymptomxe2x80x9d of xe2x80x9cJulyxe2x80x9d and xe2x80x9cinspectionxe2x80x9d and xe2x80x9cfoodsxe2x80x9d of xe2x80x9cAugustxe2x80x9d which rarely appears for other components of the analysis axis are ignored. It is therefore not possible to appropriately present a different point with respect to meaning between the components of the analysis axis to the user.
It is therefore an object of the present invention to provide a data display method and a data display apparatus in which the user can suitably analyze the contents of a plurality of documents.
According to one aspect of the present invention, a frequency of appearances of a plurality of words and phrases in a document satisfying each analysis condition is calculated and the words and phrases are displayed according to a result of the calculation.
Another object of the present invention is to provide a document processing system which supports a text mining function to clarify similar points and different points of words and phrases cooccurring, or occurring together, with each component of an analysis axis so that the user can appropriately analyze a tendency of a set of the documents.
To achieve the objects according to one aspect of the present invention, there is provided a text mining method including a characteristic words and phrases extraction step of collecting, from a set of documents beforehand registered, all of or part of the documents into a set of processing objective documents and of extracting therefrom words and phrases characteristically appearing therein, a mining scheme creation step of setting definition information or a mining scheme containing components specified, a cooccurrence words and phrases acquisition step of acquiring, from the words and phrases extracted by the characteristic words and phrases extraction step, cooccurrence words and phrases cooccurring in a predetermined range with each component contained in the mining scheme, and a multiple cooccurrence words and phrases extraction step of comparing cooccurrence words and phrases between the elements or components contained in the mining scheme, of acquiring, as multiple cooccurrence words and phrases, cooccurrence words and phrases related to many components contained in the mining scheme, and creating component-cooccurrence words and phrases by removing the multiple cooccurrence words and phrases from the cooccurrence words and phrases of the respective components.