The present invention relates to a computer system, a method, and a computer program for creating a terms dictionary with named entities or terminologies included in text data.
Named entity or terminology extraction is a natural language processing technique for extracting an expression or term. The expression can belong to a specific word category (for example, a person's name, a company name, a disease name, a telephone number, or a chemical compound name). The term can belong to a specific specialized field included in a body of text data. The named entity or terminology extraction is used in a wide variety of techniques, such as text mining and confidential information masking. One extraction method uses a list of expressions belonging to a vocabulary category or a terminology category as set data for an extractor of named entities or terminologies. The set data is generally referred to as “dictionary”.
When a named entity or terminology is not registered in the dictionary in the execution of morphological analysis or the like, the named entity or terminology is treated as an unknown word. In context, the unknown word is a word to which a word class is not assigned in the morphological analysis. During extraction, unknown words can result in analysis error occurrences. Therefore, it is necessary to create various terms dictionaries of named entities or terminologies. Many text bodies for which extraction is used (such as a newspaper article) include a large number of named entities or terminologies. It can be difficult to manually create terms dictionaries, due to the quantity of included named entities or terminologies.
Some automated attempts (machine learning algorithms) to acquire named entities or terminologies have been attempted. A typical example is to input a morphological analysis result or a syntactic analysis result to learn a set of features. These features are able to be determined only with a word to be classified, a word adjacent to the word, and the association (conditional probability) with which the word is classified as the named entity. Such a method can determine, for example, that a word to be classified is a katakana noun and that the subsequent word represents incorporation. This type of machine learning algorithm easily enables low cost and high accuracy. The machine learning algorithm, however, is not able to ensure reliable classification and therefore it is impossible to use the machine learning algorithm in cases where omission of extraction is not permitted.
Moreover, there is a widely used method of automatically determining a word to be classified by pattern matching of regular expressions. The pattern matching, however, does not enable meaning distinction though it enables surface distinction. Therefore, the pattern matching requires a human to recheck the word in order to distinguish the meaning. In cases where the word is rechecked by a human, however, it is unadvisable to use a result of words cut out only with surface information.
Another method is to perform pattern matching of a token sequence obtained as a result of morphological analysis. In the pattern matching method, however, a pattern matched with a token sequence practically depends on peripheral information of a target of extraction and thus this method only enables the acquisition of a probabilistic result in the same manner as in the machine learning.
Still another method is to automatically obtain a vocabulary by determining word classes with respect to a combination of an unknown word and a conjunctional word of the unknown word based on a morphological analysis result of Japanese words including kana, kanji, and alphanumeric characters. The unknown word is a word to which a word class is not assigned in the morphological analysis.
Further, there is still another method including a process of manually performing editing by determining whether to include words around an unknown word as new registered words.