The present invention relates to a computer system, a method, and a computer program for extracting terms from document data that includes a text segment.
Nowadays, there are tremendous amounts of technical documents, e.g., requirement documents and specification documents. Thus, techniques to promptly understand the content of the technical documents are required. Extracting and presenting terms which appear in a technical document is a useful solution for prompt understanding. Many methods of extracting terms from a text have been proposed. However, simply extracting terms results in a mere enumeration of many terms. Since general methods of extracting terms are not specialized for technical documents, a user needs to manually classify types of the terms after the terms have been extracted. Thus, application of such methods to technical documents is impractical.
Meanwhile, there is known technique called the named entity (NE) extraction technique, i.e., a technique for automatically extracting terms of a specific type such as a personal name, place name, or organization name. The NE extraction technique requires development of a dictionary for extracting the terms and of an extraction rule. In order to create such a dictionary, a user must scrutinize the content of a technical document, and then determine which words are to be extracted as terms. However, this technique is tremendously costly.
Japanese Patent Application Publication No. Hei 10-177575 describes calculating a temporary importance and calculating a formal importance based on the temporary importance. Specifically, a predetermined phrase is extracted from text data and a temporary importance is calculated based on information on at least one of words, parts of speech, and segments included in the extracted phrase. Then, a formal importance is calculated from the temporary importance in accordance with an appearance state of the phrase in the text data.