Against backdrops of the price-reduction and generalization of information processing systems, the generalization of document creation tools including word processors, and the recent progress of network environments including the Internet, a huge amount of electronic data is being accumulated. For example, all kinds of information, such as various in-house documents including sales reports and reports of conversations with customers in call centers, is being accumulated as electronic data in information processing systems.
In general, the accumulation of such information is intended to extract useful knowledge usable in corporate activities, operating activities, and the like. For example, the knowledge is about product sales trends, customer trends, complaints and requests about quality or the like, the early detection of faults, and the like. In order to obtain such useful knowledge from raw information, the raw information needs to be analyzed from some viewpoint. In the case where the raw information is previously labeled with classifications or the like, analysis is relatively easy. However, the knowledge capable of being obtained from documents classified by item based on an already conceived viewpoint does not go much beyond the viewpoint. That is, new knowledge not capable of being conceived in advance is often extracted from unclassified free-form descriptions. Therefore, what is needed is a method for analyzing raw information based on a document recorded in a free-form from a liberal viewpoint, for example, what the topic of the document is, how the time series trend of the topic is, or the like.
One of such methods is text mining in which a large amount of text data is processed and analyzed. For example, an analysis tool which makes it possible to use, as an analysis object, a wide variety of contents described in a large amount of document data and which utilizes a text mining technique for extracting and providing the correlations and appearance trends thereof is described in Tetsuya Nasukawa, Masayuki Morohashi, and Tohru Nagano, “Text Mining—Discovering Knowledge in Vast Amounts of Textual Data—,” Magazine of Information Processing Society of Japan, Vol. 40, No. 4, pp. 358-364 (1999). The use of such a method (tool) makes it possible for a person to discover useful knowledge by automatically analyzing a huge number of raw documents without reading all of the raw documents.
In text mining, what meaning (positive or negative, or a question or a request) a concept (topic) described in a document or a given topic (concept) is given is focused on. Accordingly, it is necessary to extract not a word as represented in the document but an appropriate concept to perform analysis for each concept. That is, it is necessary not only to merely automatically handle a word represented in the document but also to appropriately grasp the concept meant by the word.
When such a concept is extracted from a written word, the handling of synonyms or homographs of the word becomes a problem. Specifically, in the case where the concept meant by a word represented by one notation is represented by other notations, a group of the words meaning the same concept must be handled as synonyms. If words are regarded as different words when the words are synonymous but represented by different notations, the frequency of the concept meant by the different words is not correctly counted, and the document may not be analyzed properly. Moreover, there are cases where words represented by the same notation mean different concepts depending on fields or situations where the words are used. For example, the word “driver” means software for running a device if the word is a computer-related word, but means a driving person if the word is an automobile-related word. When words represented by the same notation mean different concepts, if the words are not accurately distinguished to be grasped, the proper frequencies of the concepts are not counted similarly to the above, thus making correct analysis of the document difficult.
Accordingly, for the problem of synonyms, words have been unified into the same representation using an existing thesaurus such as the EDR dictionary or a synonym table conventionally. The EDR dictionary is a dictionary of 200 thousand words for each of Japanese and English, which contains a word dictionary, a cooccurrence dictionary, and a concept dictionary, and is described at, for example, http://www.iijnet.or.jp/edr/J_index.html.” On the other hand, it is possible to solve the problem of homographs by adding differences of meanings as comments to words. However, this method requires a very high cost in processing a large number of documents and therefore has low feasibility. Accordingly, in the case where documents in a fixed field are analyzed, meanings appropriate for the field are assigned to homographs and the homographs are handled by being assumed to be synonymous with the word of the meanings, thereby solving this problem. In order to achieve this, creating a dictionary for each field is essential.
Incidentally, about a method for extracting synonyms from a corpus (large amount of document data), the following researches have been known. For example, a research for finding degrees of relatedness between nouns using cooccurrence data of verbs and nouns, such as subjects and objects, is described in Donald Hindle, “Noun Classification From Predicate-Argument Structures,” Proc. 28th Annual Meeting of ACL, pp. 268-275 (1990). This research can be applied to a method for extracting, as synonyms, nouns having high degrees of relatedness with an object noun. Moreover, a research for finding degrees of relatedness between nouns using not cooccurrence relations but dependency relations with verbs and adjectives to check the magnitude relation of abstraction level for the nouns is described in Tomek Strzalkowski and Barbara Vauthey, “Information Retrieval Using Robust Natural Language Processing,” Proc. 30th Annual Meeting of ACL, pp. 104-111 (1992). Furthermore, a research for extracting changeable relations between words using grammar information in a corpus is described in Naohiko Uramoto, “Improving the ratio of applying cases using changeable relations to the elimination of ambiguities of sentences,” Journal of the Japanese Society for Artificial Intelligence, Vol. 10, No. 2, pp. 242-249 (1995). The above-described researches can be also utilized for checking degrees of relatedness between nouns.
For synonyms and homographs, which become problems in adopting a text mining technique, the above-described solutions are prepared tentatively. However, the present inventors recognize that there is another problem as follows. Specifically, the problem is about differences of notations due to abbreviations, misspelling, and the like.
In general, most text data used in text mining is created by a plurality of persons, for example, in-house documents, records or the like of questions received in a call center, and the like. In such documents created by a plurality of persons, word notations are not unified, and abbreviations or the like tend to be frequently used because the documents are relatively informal documents. For example, in a call center, a word “customer” is frequently used. This is sometimes written as “cus” or “cust” depending on recording persons. It can be hardly expected that abbreviations are included in dictionaries. Accordingly, if synonyms are generated using an existing dictionary, all of such abbreviations are handled as unknown words. If abbreviations are handled as unknown words, the abbreviations are handled as not words having intrinsic meanings but other words. The abbreviations are also not counted in frequencies of original words, and discarded as noises because the number of the abbreviations is small. Moreover, in such in-house documents and the like, words are often misspelled when the words are inputted to computers. In particular, in records in a call center or the like, since documents need to be created within a limited time, typos are often generated. Documents containing such misspellings are also handled as meaningless noises similarly to the above.
However, the more frequently words are used, the highest the possibility that the words are abbreviated is. Whereas, there are many cases where concepts related to words are important because the words are ones frequently appearing. Moreover, in general, documents created in departments directly facing customers have high possibility of containing misspellings because creation times are limited as the example of the call center. Whereas, such documents created in departments directly facing customers have high possibility of recording useful customer information and containing important knowledge to companies. That is, there is great significance in handling words, which are not given in dictionaries, such as abbreviated words and misspelled words as meaningful data. Note that the case where a double-byte character of Japanese, Chinese, Korean, or the like is misconverted by a front-end processor (FEP) is handled similarly to the case of misspelling.
Accordingly, it is necessary to create a dictionary in consideration of abbreviations, misspellings (including misconversion), and the like. Since an existing dictionary does not cover all of abbreviations and misspellings, a dictionary necessary for text mining must be created by humans. This is a task requiring a very high cost and a part which users are most concerned about in an actual operation of text mining. Therefore, a support system for creating a dictionary, which automatically generates synonyms for creating a thesaurus, is needed.
As a method for automatically generating synonyms, the aforementioned researches of Literatures 2 to 4 can be utilized. Specifically, degrees of relatedness between nouns are found by the methods of the aforementioned researches, and nouns within a predetermined range, which have high degrees of relatedness, are set as synonyms. However, if these methods are used, there is a problem that antonyms are acquired in addition to the synonyms. That is, if a conventional method is adopted as it is, many noises including antonyms and others are acquired, whereby the removal of the noises by humans becomes complex.
Further, in fields significantly progressing, such as the computer field, new words are generated one after another. These new words need to be rapidly and appropriately handled in text mining.