In recent years, various types of documents have been converted into electronic documents. It is important to effectively use various pieces of information written in the electronic documents. In order to effectively use the electronic documents, natural language processing techniques have attracted attention.
When the documents are processed semantically with natural language processing, cooccurrence word information is used in many cases.
For example, since it is considered that similar cooccurrence words are semantically similar to each other, the semantic similarity of two words is calculated to be higher when the cooccurrence words are more similar to each other. In addition, in hiragana-to-kanji conversion, a candidate from among conversion candidates is determined to be likely when it often cooccurs with a previously established word.
Examples of the conventional cooccurrence dictionary creating system are disclosed in Patent Document 1.
The cooccurrence dictionary creating system of Patent Document 1 includes a document analyzing section which analyzes a given document group, a word extracting section which extracts words existing in the given document group and causing a storage unit to store the extracted words, a word-chain extracting section which extracts word chains existing in the given document group and causing the storage unit to store the extracted word chains, a number-of-cooccurrences detecting section which detects the number of cooccurrences between each word and each word chain and causing the storage unit to store the number of cooccurrences, a concept information quantifying section which detects a cooccurrence degree in accordance with the number of cooccurrences, and for quantifying word concept information based on the detected cooccurrence degree and causing the storage unit to store the quantified concept information, and a concept information database creating section which makes a database of the word concept information which is obtained by the concept information quantifying section.
The above-mentioned “word chain” is a chain of n (n is two or more) words which are continuous in a document.
According to Patent Document 1, first of all, each sentence in a document group is subjected to a morpheme analysis. Then, all words or word chains (chains constituted by two or more words) are extracted from the result of the morpheme analysis and are stored in the storage unit. Subsequently, the number-of-cooccurrences detecting section extracts, from among the extracted independent words (nouns, pronouns, verbs, adjectives, adverbs) or word chains, cooccurring independent words or word chains and counts the number of appearances. The number-of-cooccurrences detecting section sends the counting result to the concept information quantifying section. Here, the number of appearances is counted when words or word chains cooccur in a predetermined document range. The “predetermined document range” is either a document, a paragraph, or a sentence. Then, the concept information quantifying section calculates the cooccurrence degree for each of the extracted words or word chains with each of the words or word chains based on the counting result by the number-of-cooccurrences detecting section. Here, the cooccurrence degree is a value obtained by dividing the number of cooccurrences by the number of appearances of one word, which constitutes the cooccurrence information, and normalizing the result.
The first problem of the conventional technique is that it is difficult to generate a high quality cooccurrence dictionary. This is because while the cooccurrence dictionary creating system disclosed in Patent Document 1 collects all the cooccurrences within a predetermined range such as a document, a paragraph, or a sentence, the collected cooccurrences include the ones without any semantic relationships in practice. Considering a case of obtaining cooccurrence information from a sentence “Curry Wa Karai Ga, Fukujinzuke Wa Shoppai. (these Japanese words mean ‘the curry is spicy, and the pickles are salty’)”, for example, “Curry, Karai (this Japanese word means ‘spicy’)”, “Curry, Fukujinzuke (this Japanese word means ‘pickles’)”, “Fukujinzuke (this Japanese word means ‘pickles’), Shoppai (this Japanese word means ‘salty’)”, “Curry, Shoppai (this Japanese word means ‘salty’)”, “Fukujinzuke (this Japanese word means ‘pickles’), Karai (this Japanese word means ‘spicy’)”, and the like are obtained as the cooccurrences according to the technique of Patent Document 1. Here, three type cooccurrences of “Curry, Karai (this Japanese word means ‘spicy’)”, “Curry, Fukujinzuke (this Japanese word means ‘pickles’)”, and “Fukujinzuke (this Japanese word means ‘pickles’), Shoppai (this Japanese word means ‘salty’)” are semantically appropriate. However, although “Curry, Shoppai (this Japanese word means ‘salty’)”, “Fukujinzuke (this Japanese word means ‘pickles’), Karai (this Japanese word means ‘spicy’)” are grammatically appropriate, these expressions are not usually used. As described above, the cooccurrence dictionary creating system disclosed in Patent Document 1 collects a great number of cooccurrences with low semantic relationships. This tendency appears more markedly when the range for obtaining the cooccurrences becomes wider from a sentence to a paragraph to a document.
The second problem of the conventional technique is that a large storage region is required for storing cooccurrence information, and that the storage capacity of the cooccurrence dictionary becomes large. This is because in the cooccurrence dictionary creating system disclosed in Patent Document 1, as the number of words in a document group or the number n of the word chains for the expressions (referred to as a complex expression) constituted by complex words increases, the number of the types of the word chains increases. In order to store the cooccurrence degrees of the complex expressions, it is necessary to provide a region for storing a number of numerical values, which would be a square number of the types of the word chains, in the worst case. For example, it is assumed that 1000 words are used in the document group and n is 3. Then, the number of the types of the complex expressions becomes approximately one billion (=1000×1000×1000) in the worst case. That is, the cooccurrence dictionary creating system disclosed in Patent Document 1, which is configured to store all the cooccurrence degrees, requires a region for storing a numerical value that is the square number of one billion in its cooccurrence dictionary.
Patent Document 1: Japanese Unexamined Patent Publication, First Publication No. 2006-215850
Non-Patent Document 1: Akiko Aizawa, “Kyoki Ni Motozuku Ruijisei Syakudo (Cooccurrence-based Similarity Criteria)”, Operation research as a management science research, vol. 52, No. 11, PP. 706-712, 2007
Non-Patent Document 2: T. Hofmann, “Probabilistic Latent Semantic indexing”, Proc. of SIGIR '99, pp. 50-57, 1999
Non-Patent Document 3: M. A. Hearst, Text Tiling: Segmenting Text into Multiparagraph Subtopic Passages, Computational Linguistics, Vol. 23, No. 1, pp. 33-64, 1997