1. Field of the Invention
The present invention generally relates to real time topic detection, and more particularly to the use of likelihood based methods for segmenting textual data and identifying segment topics.
2. Background Description
The problem of automatically dividing a text stream into topically homogeneous blocks in real time arises in many fields that include a topic detection task: command recognition, speech, machine translation, event detection, language modeling etc. In each of these fields there exist applications that require real time segmentation of text. For example, a close caption via automatic speech recognition would be improved significantly with a real time topic identification.
There exist some methods for dealing with the problem of text segmentation. In general, approaches fall into two classes: 1) content based methods, which look at topical information such as n-grams or IR similarity measures; and 2) structure or discourse-based methods, which attempt to find features that characterize story opening and closings.
Some of these methods are based on semantic word networks (see H. Kozina, "Text Segmentation Based on Similarity between Worlds", in Proceedings of the ACL, 1993), vector space technique from information retrieval (see M. A. Hearst, "Multi-paragraph Segmentation of Expository Texts", in Proceedings of the ACL, 1994. Proc Eurospeech '93, pp. 2203-2206, Berlin), and decision tree induction algorithm (see D. J. Litman and R. J. Passonneau "Combining Multiple Knowledge", in Proceedings of the ACL, 1995). Several approaches to segmentation are described in Jonathan P. Yamron, et al., "Event Tracking and text segmentatkion via hidden Markov models", in 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings(IEEE, New York: 1997), pp. 519-26.
Several approaches to topic detection are described in DARPA, Broadcast News Translation and Understanding Workshop, Febr. 8-11, 1998. Some of them (e.g. Sadaki Furui, Koh'ochi Takagi, Atsushi Iwasaki, Katstoshi Ohtsuki and Shoichi Matsunaga, "Japanese Broadcast News Transcription and Topic Detection") require all words in an article to be presented in order to identify a topic of the article. A typical approach for topic identification is to use key words for a topic and count frequencies of key words to identify a topic.
These methods are not very successful in detection of the topical changes present in the data. For example, model-based segmentation and the metric-based segmentation rely on setting measurement thresholds which lack stability and robustness. Besides, model-based segmentation does not generalize to unseen textual features. Furthermore, the problem with using textual segmentation via hierarchical clustering is that it is often difficult to determine the number of clusters. All these methods lead to a relatively high segmentation error rate and are not effective for real time applications. Therefore new complementary segmentation methods are needed.
Concerning known topical identification methods, one of their deficiencies is that they are not suitable for real time tasks since they require all data to be presented. Another deficiency is reliance on several key words for topic detection. This makes real time topic detection difficult since key words are not necessarily present at the onset of the topic. Another problem with key words is that a different topic affects not only frequencies of key words but frequencies of other (non key) words. Exclusive use of key words does not allow one to measure contribution of other words in topic detection.