Text mining is processing of, with a set of texts classified into a plurality of categories according a certain classification axis as an input, discovering a characteristic inherent to text belonging to a specific category designated as an analysis target to select a set of texts having the characteristic.
This enables a user to know a characteristic that a designated category has and a specific example (text) relevant to the characteristic.
In text mining in general, text is analyzed and classified based on a word extracted from text, and as a result, a condition formed of a word or of a combination of words as a characteristic (hereinafter, it will be referred to as “characteristic condition”) is output and a set of texts matching the condition (hereinafter referred to as “characteristic text set”) is further output.
A text mining device according to related art extracts a word from each text to extract a word whose degree of association with text belonging to a category to be analyzed is high or a combination of such words as a characteristic of the category.
Accordingly, with appearance of an extracted word or combination of words as a characteristic condition, the text mining device according to the related art enables output of text including the extracted word or combination of words as a characteristic text set. The characteristic condition and the characteristic text set will be a mining result.
One example of a text mining device of this kind is recited in Literature 1 (Japanese Patent Laying-Open NO. 2003-141134).
The text mining device recited in Literature 1 has a characteristic word extraction processing unit for extracting a characteristic word appearing in text as a target of mining, an analysis axis setting processing unit for setting a classification axis as a reference for analysis and a relating word acquisition processing unit for extracting a word having a high degree of association with each category of the classification axis, thereby extracting a characteristic word in each category of the classification axis set by a user as a reference for analysis.
Another related art text mining device extracts a word from each text to divide text belonging to a category to be analyzed into a plurality of clusters based on a word appearance tendency. In this text mining device, a condition defining a cluster will form a characteristic condition, text belonging to the cluster will form a characteristic text set, and the characteristic condition and the characteristic text set will be a mining result.
One example of a text mining device of this kind is recited in Literature 2 (Japanese Patent Laying-Open No. 2001-101194).
The text mining device recited in Literature 2 has a word cut-out unit for extracting a word from text as a mining target, and a cluster generation unit for evaluating association between extracted words to generate a cluster with text including a set of words whose degree of association is not less than a prescribed value as the same cluster, thereby dividing the text to be mined into a plurality of clusters.
The related art text mining enables a characteristic (characteristic condition) peculiar to text belonging to a specific category to be discovered to obtain a set of texts having the characteristic (characteristic text set).
In general, however, there exist numerous texts satisfying the same characteristic condition to make it difficult for a user to see all the texts in a characteristic text set.
One example of a device for generating an index for order at the time when a user sees texts in a characteristic text set is recited in Literature 3 (Japanese Patent Laying-Open No. 2004-86351).
The text information analysis system recited in Literature 3 has a vector calculation unit for obtaining a vector indicative of each text belonging to a characteristic text set, a center of gravity calculating unit for calculating a center of gravity of each vector, and a degree of association calculating unit for obtaining a degree of association between text and a characteristic text set from a relationship between a vector and a center of gravity, thereby assigning a degree of association with a characteristic text set to each text in the characteristic text set.
This arrangement allows a user to, at the time of seeing texts in a characteristic text set, sequentially see the texts in a descending order of a degree of association with a characteristic text set.
Literature 1: Japanese Patent Laying-Open No. 2003-141134.
Literature 2: Japanese Patent Laying-Open No. 2001-101194.
Literature 3: Japanese Patent Laying-Open No. 2004-86351.
When voice data is formed into text by voice recognition, not all the spoken words are correctly recognized and there is a case where a spoken word is erroneously recognized as a different word. This is also the case with forming image data into text by character recognition, in which not all the written words are correctly recognized and there is a case where a written word is erroneously recognized as a different word.
Text generated by voice recognition or character recognition might include an erroneous word.
The related art text mining devices, however, are premised on input of texts electronically generated in advance, to which no consideration is given to a case of input of text which might include an erroneously recognized word (recognition error) such as a voice recognition result of voice data and a character recognition result of image data.
There might therefore occur a case where even when a user, in order to know a representative example of text satisfying a certain characteristic condition, reads text in a characteristic text set corresponding to the characteristic condition (or original voice data or image data of the text), the user fails to understand the contents due to inclusion of a recognition error in the text.
Also, a user is not allowed to know in advance in a set of characteristic texts, text which has a reduced number of recognition errors therein and whose contents are easy to understand.
In addition, when the text mining device receives input of text including a recognition error, its mining result might include an error. In other words, a characteristic condition and a characteristic text set obtained as a result of mining are not always correct.
Moreover, whether an output characteristic condition is really characteristic in text belonging to a designated category or whether text in an output characteristic text set really satisfies a characteristic condition can be determined only by actual reference by a user to text in the characteristic text set (or original voice data or image data of the text).
Therefore, in order to obtain a representative example of text satisfying a certain characteristic condition, even when a user refers to text in a characteristic text set corresponding to the characteristic condition (or original voice data or image data of the text), the text might fail to actually satisfy the characteristic condition because of inclusion of a recognition error in the text (in other words, the text might not be an appropriate representative example of text satisfying the characteristic condition).
In addition, the user is not allowed to know in advance in a characteristic text set, text which will be an appropriate representative example.
Moreover, most of texts in a characteristic text set corresponding to a certain characteristic condition include a recognition error, so that there might barely exist texts satisfying the characteristic condition.
In this case, it is highly possible that the characteristic condition is not characteristic in practice in text belonging to a designated category.
In a characteristic text set corresponding to the same characteristic condition, however, there in general exist text really satisfying the characteristic condition and text failing to satisfy the characteristic condition in practice, so that it is difficult to determine whether the characteristic condition is appropriate as a mining result or not only by reference to a part of the texts.
Even when referred text fails to satisfy a characteristic condition in practice, for example, there is a possibility that such text was accidentally referred to, so that no determination can be made whether the characteristic condition is appropriate as a mining result only from the fact.
As described in the foregoing, the first problem of the related art text mining device is that when text mining is executed with respect to text including a recognition error, it is impossible for a user to select a representative example of text which has a reduced number of recognition error therein, whose contents are easy to understand and which satisfies a certain characteristic condition.
The reason is that a user is provided with no information indicating approximately how many recognition errors each text in a characteristic text set includes.
The second problem is that when text mining is executed with respect to text including a recognition error, it is impossible to prevent a user from selecting text erroneously considered to satisfy a characteristic condition due to a recognition error in the text.
The reason is, similarly to the reason of the first problem, that a user is provided with no information that indicates approximately how many recognition errors each text in a characteristic text set includes.
The third problem is that when text mining is executed with respect to text including a recognition error, it is difficult for a user to determine whether a characteristic condition is appropriate or not by referring to a part of texts in a characteristic text set obtained as a result of the mining.
The reason is that a user is provided with no information that indicates approximately how many texts which have a possibility of actually satisfying the characteristic condition exist in the characteristic text set.
An exemplary object of the present invention is to provide a text mining device capable of presenting text having little possibility of including a recognition error therein as a representative example of text satisfying a certain characteristic.
Another exemplary object of the present invention is to provide a text mining device capable of presenting, as a representative example of text satisfying a certain characteristic, text having little possibility of being erroneously considered to satisfy the characteristic condition due to a recognition error in the text.
A further exemplary object of the present invention is to provide a text mining device enabling a user to determine whether a mining result is appropriate or not by referring to a part of texts in a text set having a common characteristic.