Recently, in proportion to the increase in electrical text information, a text information analysis technique to classify or arrange a plurality of texts according to the content of each text attracts a user's attention. The text information analysis technique includes a text classification technique (categorization/classification) and a text clustering technique. In the text classification technique, N units of categories are previously determined and each text is classified to at least one of the plurality of texts. In the text clustering technique, categories are not previously determined, a similarity degree between texts is determined, and a plurality of texts are classified to arbitrary units according to the similarity degree.
In the text classification technique, suitability between each text and N units of categories is decided. Accordingly, the processing of each text comprises N steps and relatively executed at high speed. However, if the content of the text is not similar to a feature of one category, the text is not classified. Especially, in case that new text content occurs daily, i.e., if tendency of text content changes daily, classification using predetermined categories is often impossible. In this case, it is necessary that a new category is set automatically or by hand.
On the other hand, in the text clustering technique, the content of the text drives the analysis. Accordingly, this technique is effective for text with unknown content. However, in general, the computational cost is enormous. In case of clustering of m units of text, a similarity degree between each pair of texts in m units of text is calculated and processing steps of a square of m is necessary.
In a text information analysis system, only one of the text classification technique and the text clustering technique is used. However, both techniques include defects. On the other hand, a large number of unknown texts appear daily. The unknown texts are not always classified to a predetermined category. Accordingly, it is difficult to satisfy actual needs that the unknown texts are quickly classified and arranged.