The text segmentation technique for segmenting a text, composed of strings of words or letters/characters, in terms of a topic having semantic coherence as a unit, is among the critical fundamental techniques in the processing of natural languages. By segmenting a text by each topic, it becomes possible to
classify a huge amount of text into each topic,
extract a entire structure of the text, and
prepare summaries of the respective topics.
On the other hand, video contents are being distributed in larger quantities. Text segmentation may be used in preparing a written text of the speech contained in an image or in a text representing the results of speech recognition in such a manner as to improve the ease in viewing or retrieving the video contents. Thus, the importance of the text segmentation technique is increasing.
The technique for text segmentation may roughly be classified into two techniques. These two techniques are now described in detail with reference to the drawings.
The first one detects, as a boundary between topics, a change point of word distribution in an input text targeted for segmentation. This technique postulates that the same word distribution will persist in an interval belonging to the same topic in the input text. A representative example of the first technique is the Hearst method stated in Non-Patent Document 1 (first related technique).
FIG. 10 schematically shows the operation of the Hearst method. Referring to FIG. 10, the Hearst method sets a window of a constant width for each section of the input text, and finds the word distribution in each window. The word distribution in one of the windows and that in the neighboring window are compared to each other to detect a point of abrupt change in the word distribution. This point is taken to be the word boundary. As the word distribution, a unigram found on counting the frequency of occurrences of a given word in the window is often used. Alternatively, the frequency of occurrences of doublet, triplet or the like of neighboring words may be used as the word distribution. To detect the point of abrupt change in the window, it is sufficient that the degree of similarity of word distributions of neighboring windows is found by, for example, cosine similarity, and such a point where the locally minimum point of a sequence of the values of the degree of similarity is less than or equal to a threshold value is found. If, in FIG. 10, th2 is set as the threshold value of the degree of similarity, segmentation points H1 to H7 are obtained. If th3 is set as the threshold value of the degree of similarity, segmentation points H2 and H6 are obtained.
It is seen from above that, with the Hearst method, some form of the results of segmentation or other may be output irrespectively of what input text has been presented.
However, in the Hearst method, there are a variety of parameters that control the results of segmentation, such as
window width,
threshold value of the degree of similarity or
the number of times of operations performed for smoothing the values of the degree of similarity. Depending on the values of these parameters, the sorts of the topic units into which the input text is segmented are changed.
A second one of the techniques for text segmentation has the knowledge relating a variety of topics and utilizes the knowledge for segmenting the input text into respective topics. An example of this second technique is shown in Non-Patent Document 2.
FIG. 11 schematically shows the operation of the technique disclosed in Non-Patent Document 2 (second related technique). Referring to FIG. 11, the technique disclosed in the Non-Patent Document 2 learns statistical models, that is, topic models, concerning a variety of topics such as ‘baseball’ or ‘exchange’, at the outset, using a text corpus segmented on a per topic basis, such as a newspaper article. As the topic model, a unigram model that has learned the frequency of occurrences of a word appearing in each topic may, for example, be used. If the liability of the occurrence of a transition between topics is set as appropriate, a sequence of topic models, optimally matched to the input text, may then be found along with the positions of change points of the topics. That is, the input text may be segmented into topic units. The input text may be coordinated to the topic units by a method of computations, exemplified by a frame synchronization search, in a manner similar to the technique frequently used in speech recognition. Making correspondence between the input text and the topic models is similar to that between the input speech waveform and phoneme models which are in a wide spread use in the technique of speech recognition.
In this manner, an interval in the input text relating to the topics whose topic models are provided in advance, may be segmented with these topic models as topic units. Referring to FIG. 11, the topic models of the ‘baseball’, ‘exchange’, ‘soccer’ and the ‘general election’, provided in advance, are matched to corresponding intervals of the input text. The input text may thus be segmented into these respective topics to give segmentation points M1 to M3 and M5 to M7.
Patent Document 1 discloses a technique that combines the feature of the first technique of detecting a change point of the word distribution in the input text and the feature of the second technique of utilizing the knowledge concerning a topic in order to segment the input text on a per topic basis. The invention disclosed in Patent Document I will now be described in detail as the third related technique.
In the invention disclosed in Patent Document 1, the time series of the text, obtained from caption or speech in a video, are segmented on a per topic basis, with a view to segmenting the video on per topic basis. It is postulated that some text information or other regarding each topic is obtained beforehand by way of providing the knowledge regarding each topic which is desired to be obtained as being the result of segmentation. This text information regarding each topic is referred to below as script data.
The operation of the invention disclosed in Patent Document 1 is now briefly described. Initially, the time series of the text, extracted from the image, are segmented in accordance with the first technique. It is then verified whether or not the text of each interval resulting from segmentation is similar to the text information regarding each topic obtained from the script data. The interval not similar to any of the topics in the script data is repeatedly subjected to finer segmentation by the first technique.
Taking the case of segmenting a news program into individual news items, the operation of the invention disclosed in Patent Document 1 will now be described in detail with reference to the drawings.
FIG. 12 is a diagram showing a configuration of FIG. 2 of Patent Document 1. It should be noted that in FIG. 12, the reference numerals used therein differ from those of FIG. 2 of Patent Document 1. Referring to FIG. 12, a news program which is a target for segmentation is stored in a video data memory 602. The title text of each news item is stored in a script data memory 601 as the text information regarding each news item as the topic unit desired to be obtained as being the results of segmentation.
Initially, a script text interval acquisition means 603 refers to the script data memory 601 to acquire a title text of each news item.
A video text interval generation means 604 then segments the time series of the text, as obtained from the caption or the speech in the news program by the first technique, that is, by the technique of detecting the change point of the word distribution, using a suitable parameter. The text of each interval, resulting from segmentation, is output as the video text interval.
A text similarity degree computing means 605 then computes the degree of similarity between the text of each video text interval, resulting from segmentation by the video text interval generating means 604, and the title text of each news item as obtained by the script text interval acquisition means 603.
A text associating means 606 associates to each video text interval a news item that has a title text most similar to the text of the interval in question. The degree of similarity is to be higher than a preset threshold value.
A recursive processing control means 607 changes the parameter for the video text interval not associated with the news item by the text associating means 606. The parameter is to be changed so as to allow for more fine-grained segmentation by the video text interval generating means 604. The recursive processing control means 607 then causes the processing by the video text interval generating means 604, text similarity degree computing means 605 and the text associating means 606 to be performed repeatedly.
When the news items are associated with all of the video text intervals, or the parameter has reached a preset limit value, the iteration processing is brought to an end.
In case the same news item corresponds to neighboring video text intervals, a video text interval integrating means 608 integrates these intervals into one and outputs the so integrated intervals as the final result of segmentation.    Non-Patent Document 1:    Marti A. Hearst, “MULTI-PARAGRAPH segmentation of expository text,” 32nd Annual Meeting of the Association for Computational Linguistics, pp. 9-16, 1994    Non-Patent Document 2:    Yarmon, I. Carp, L. Gilick, S. Lowe and P van Mulbregt, “A hidden markov model approach to text segmentation and event tracking”, IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 333 to 336, 1998    Non-Patent Document 3:    Takafumi Koshinaka, Akitoshi Okumura and Ryosuke Isotani, “An HMM-Based Text Segmentation Method Using Variational Vayes Inference and Its Application to Visual Indexing”, Journal for Treatises of Society of Electronic Information Communication, vol. J89-D, No. 9, pp. 2113-2122, 2006    Patent Document 1: JP Patent Kokai JP-A-2005-167452 (FIG. 2)