The present invention is intended for a data base of registered documents, and relates to a document processing technique for acquiring various kinds of information concerning a specified document set.
With the spread of word processors and personal computers in recent years, computerized information generated by them is increasing. Furthermore, computerized information available from WWW (World Wide Web), electronic mail, electronic news, and so on is also rapidly increasing. Therefore, it has become an important problem in enterprises to analyze contents of the computerized information and make the most thereof effectively.
In general, a large quantity of computerized information is described in many cases in a text form, i.e., in a composition form. Such text information such as questionnaires of free answer form is difficult to mechanically analyze, and consequently has heretofore been subjected to analysis using human work. This analysis using human work has the following problems.
(1) It is necessary to read all documents to be processed. In the case where the documents are increased, the human analysis is not practical.
(2) Since an analysis is made on the basis of subjective judgment, the result differs depending on the knowledge of the analyst and the degree of skill.
As such a technique of supporting the human analysis, the need for text mining is becoming strong. The processing procedure of text mining is described concretely in xe2x80x9cText miningxe2x80x94Knowledge finding by automatic analysis of massive document dataxe2x80x94xe2x80x9d, Nasugawa et al. Journal of Information Processing Society of Japan, Vol. 40, No. 4, April 1999 pp. 358-364, and xe2x80x9cText mining based on keyword associationxe2x80x9d, Watanabe et al. Information Processing Society of Japan, Meeting of Information Learning Foundation 55-8, Jul. 16, 1999, pp. 57-64. Hereafter, this is referred to as related art 1. The text mining is intended for text information registered beforehand, and finds new knowledge on the basis of coincidence relations and emergence tendency of words and/or phrases included in information to be processed. To be concrete, as regards a set of documents to be processed, an axis serving as a visual point for making an analysis is set, and words and/or phrases representing a feature of the document set are acquired in association with components of the axis. Here, xe2x80x9cwords and/or phrases are acquired in association with components of the axisxe2x80x9d means xe2x80x9cwords and/or phrases coincident with components of the axis in a predetermined range are acquiredxe2x80x9d. By referring to the words and/or phrases, the user can grasp the tendency of the document set. For example, an example of the case where a set of newspaper accounts concerning xe2x80x9cpathogenic colon bacilli O157xe2x80x9d is analyzed by taking a publication month as the axis is shown in FIG. 2. By making an analysis by taking a publication month as the axis, words and/or phrases xe2x80x9cinfection, patient, symptoms, hospitalization, . . . xe2x80x9d are acquired in association with xe2x80x9cJulyxe2x80x9d which is a component of the axis. Words and/or phrases xe2x80x9cshock, school lunch, hospitalization, mass infection, . . . xe2x80x9d are acquired in association with xe2x80x9cAugustxe2x80x9d. Words and/or phrases xe2x80x9csales, minus, foodstuffs, perishables, . . . xe2x80x9d are acquired in association with xe2x80x9cSeptemberxe2x80x9d. By referring to the words and/or phrases, the user can grasp the tendency that a topic xe2x80x9cpatients infected with O157 are hospitalizedxe2x80x9d exists in the document set in xe2x80x9cJulyxe2x80x9d, the tendency that a topic xe2x80x9cmass infection with O157 is caused by school lunchxe2x80x9d exists in the document set in xe2x80x9cAugustxe2x80x9d, and the tendency that a topic xe2x80x9cthe sales of perishables have fallen under the influence of O157xe2x80x9d exists in the document set in xe2x80x9cSeptemberxe2x80x9d. In a PAD (Problem Analysis Diagram) diagram of FIG. 3, the processing procedure of the related art 1 is shown. First, at step 300, a document set which becomes the processing subject of text mining is defined.
In the case of a data base of documents, such as questionnaires, collected on the basis of a certain view-point beforehand, it is set as a document set to be processed as it is. In the case of a data base of documents, such as newspaper accounts, including diverse viewpoints of politics, economy, sports, and so on, full text search is conducted according to the analysis object of the user and the document set is defined. The full text search is such a technique that a full text in documents to be processed is inputted to a computer system to form a data base at the time of registration and the data base is searched at the time of retrieval for all documents including a character string specified by a user. The full text search is described in detail in xe2x80x9cPresent situation and future of index processing fast full text search technique which holds the keyxe2x80x9d, Majima, Nikkei byte, October 1996, pp. 158-167. Hereafter, this is referred to as related art 2. Subsequently, at step 301, words and/or phrases distinctive of the contents (hereafter referred to as distinctive words and/or phrases) are extracted from the document set preset at the step 300. The distinctive words and/or phrases may be extracted by referring to a dictionary, or may be extracted by using statistical information. At step 302, an axis serving as a visual point for making an analysis is set. Here, date, age, sex distinction, or the like provided as bibliography information of documents is set as an analysis axis, and specified words and/or phrases are set as components of the analysis axis. For example, in the case where it is desired to know difference of consciousness depending upon the age from questionnaires, the age is set here as the analysis axis. In this case, numerical values, such as xe2x80x9c20xe2x80x9d and xe2x80x9c30xe2x80x9d, representing the age become components of the analysis axis. Finally at step 303, words and/or phrases coincident with a component of the axis in a predetermined range are acquired. As the predetermined range, the same document, the same paragraph, the same sentence, m words, n characters (where m and n are integers), or the like can be used. As heretofore described, the related art 1 supports the user in grasping the tendency of the document set, by acquiring words and/or phrases in association with the components of the analysis axis. Thus, in the related art 1, words and/or phases distinctive of the document set to be processed are automatically acquired in association with components of the analysis axis. Therefore, it is possible to lighten the burden of the analyst and reduce the difference in analysis result between analysts.
In the related art 1 heretofore described, words and/or phases distinctive of the document set to be processed are automatically acquired in association with components of the analysis axis. Therefore, it becomes possible to lighten the burden of the analyst and reduce the difference in analysis result caused by the knowledge and degree of the skill of the analysts.
However, the related 1 has problems hereafter described. As shown in FIG. 3, in the related art 1, an analysis is made on the basis of only the coincidence relations with individual components of the analysis axis. In the case where it is desired to analyze coincidence relations with a plurality of different visual points, i.e., combinations of a plurality of analysis axes, it is necessary to conduct text mining for each of analysis axes, and the user must combine the results and analyze them. When the user makes an analysis, the user begins the analysis from such a state that the user does not know the contents of the document set. Therefore, it is difficult to determine one visual point from the beginning. However, the related art 1 has the above described problems, and an analysis cannot be made in combinations of a wide variety of visual points.
An object of the present invention is to provide a document processing method and system, and a computer readable storage medium which provide a text mining function allowing the user to analyze the contents of a document set from a plurality of visual points and which thereby facilitate analyzing the tendency of a document set.
In order to improve the above described problems, the present invention provides the following processing steps.
A text mining method includes distinctive word and/or phrase extraction step of extracting words and/or phrases characteristically emerging in a processing subject document set obtained by taking out whole or a part of a set of documents registered beforehand; definition information setting step of setting definition information (such as information defining components of an analysis axis) including a specified word or phrase or specified bibliography information; and coincident word and/or phrase acquisition step of acquiring coincident words and/or phrases coincident in a predetermined range with a word or phrase or bibliography information included in the definition information from among words and/or phrases extracted at the distinctive word and/or phrase extraction step. A plurality of definition information pieces are included. Furthermore, the coincident word and/or phrase acquisition step includes analysis history storage step of storing the coincident words and/or phrases and a word or phrase or bibliography information included in the definition information coincident in a predetermined range with the coincident words and/or phrases as analysis history. Furthermore, the text mining method includes multiplex coincident word and/or phrase acquisition step of acquiring coincident words and/or phrases coincident in a predetermined range with an individual word or phrase or bibliography information acquired from each of a plurality of different definition information pieces. The multiplex coincident word and/or phrase acquisition step stores the coincident words and/or phrases, and an individual word or phrase or bibliography information acquired from each of the plurality of definition information pieces coincident in a predetermined range with the coincident words and/or phrases, as the analysis history. In addition, definition information addition step and/or definition information alteration step is included. The definition information addition step adds definition information including a specified word or phrase or specified bibliography information. In addition, the definition information addition step extracts coincident words and/or phrases obtained as those coincident with an individual word or phrase or bibliography information acquired from each of a plurality of different definition information pieces before addition of the definition information, from the analysis history, and puts forward the extracted coincident words and/or phrases as candidates of the coincident words and/or phrases of the multiplex coincident word and/or phrase acquisition step. The definition information alteration step alters a word or phrase or bibliography information included in specified definition information to a specified word or phrase or specified bibliography information. In addition, the definition information alteration step extracts coincident words and/or phrases obtained as those coincident with an individual word or phrase or bibliography information acquired from each of a plurality of different definition information pieces before addition of the altered definition information, from the analysis history, and puts forward the extracted coincident words and/or phrases as candidates of the coincident words and/or phrases of the multiplex coincident word and/or phrase acquisition step.
In order to achieve the above described object, programs implementing the above described functions or a recording medium storing programs may also be used.