The Hearst method is well known as one of document segmentation methods by detecting topic boundaries (the Hearst method can be seen in “Multi-paragraph segmentation of expository text” by M. A. Hearst in Proceedings of the 32nd Annual Meeting of Association for Computational Linguistics, pp. 9-16, 1994). This method first defines certain size windows before and after one of topic boundary candidate points and obtains similarity about what terms appear in each window. If the obtained similarity is high, topic relatedness between the windows is high and the candidate point cannot be a boundary point. If the obtained similarity is low, cohesion between the windows is regarded low and the candidate point can be a boundary point. More specifically, the topic boundary candidate point is moved by regular intervals from the top to the end of the document, similarity value is obtained for each point, and then the point of which the obtained similarity value is minimal can be selected as the topic boundary.
The Hearst method compares the terms occurring in the windows defined before and after the candidate point in order to detect the topic discontinuity between the windows. This method, however, has the following shortcomings. The first one is related to the window size. Since the window size is selected very arbitrarily, the distance between the topic boundaries, that is, the length of the topics tend to be longer when the selected window size is large, whereas the length of the topics tends to be shorter when the selected window size is small. Therefore, it is difficult to properly divide such documents containing many topics those of which length are different. The second shortcoming is related to the method for detecting topic similarity between the windows before and after the candidate point. A conventional method cannot obtain the similarity unless the same terms occur in both windows because it determines the similarity based on the commonality of terms between the windows. In reality, however, when one of a pair of terms that are relevant to each other in the document appears in first window and the other appears in second window, it is considered that the topic similarity exists between the windows. For example, if there is a sentence that has “Giants” and “Matsui” in an article about baseball, “Giants” and “Matsui” can be regarded as relevant terms. Therefore, if “Giants” appears in the first window and “Matsui” appears in the second window, it can be regarded that the topic similarity exists between these windows even if any other common terms are not contained in the windows. The conventional methods cannot detect such similarity because they focus on only the commonality of terms. Thus the conventional method has a problem related to accuracy of the topic similarity.