1. Field of the Invention
The present invention generally relates to off-time topic (e.g., off-line) detection, and more particularly to the problem of off-line detecting topical changes and topic identification in texts for use in such practical applications as improving automatic speech recognition and machine translation.
2. Description of the Related Art
Conventional methods exist for the above-mentioned topic identification problem. However, hitherto the present invention, no suitable method has been employed even though such can be useful in various off-line tasks such as textual data mining, automatic speech recognition, machine translation, etc. This problem requires solution of a segmentation task that present independent interest.
There exist some methods for dealing with the problem of text segmentation. In general, the conventional approaches fall into two classes:
1) Content-based methods, which look at topical information such as n-grams or IR similarity measures; and
2) Structure or discourse-based methods, which attempt to find features that characterize story opening and closings.
Several approaches on segmentation are described in Jonathan P. Yamron xe2x80x9cTopic Detection and Tracking Segmentation Taskxe2x80x9d, Proceedings of the Topic Detection and Tracking Workshop, University of Maryland, October 1997. This paper describes a content-based approach that exploits the analogy to speech recognition, allowing segmentation to be treated as a Hidden Markov Model (HMM) process.
More precisely, in this approach the following concepts are used: 1) stories are interpreted as instances of hidden underlying topics; 2) the text stream is modeled as a sequence of these topics, in the same way that an acoustic stream is modeled as a sequence of words or phonemes; and 3) topics are modeled as simple unigram distributions.
There exist several approaches for topic detection that have been described in a workshop (e.g., see DARPA, Broadcast News Translation and Understanding Workshop, Feb. 8-11, 1998). Some of them (e.g., xe2x80x9cJapanese Broadcast News Transcription and Topic Detectionxe2x80x9d, Furui, et al., in DARPA, Broadcast News Translation and Understanding Workshop, Feb. 8-11, 1998) require all words in an article to be presented in order to identify a topic of the article. A typical approach for topic identification is to use key words for a topic and count frequencies of key words to identify a topic (see for example xe2x80x9cJapanese Broadcast News Transcription and Topic Detectionxe2x80x9d, Furui, et al., in DARPA, Broadcast News Translation and Understanding Workshop, Feb. 8-11, 1998).
Recently, a method for realtime topic detection that is based on likelihood ratio was described in xe2x80x9cReal time detection of textual topical changes and topic identification via likelihood based methodsxe2x80x9d, Kanevsky, et al., commonly-assigned U.S. patent application Ser. No. 09/124,075, filed on Jul. 29, 1998 incorporated herein by reference.
However, the above-mentioned methods have not been very successful in detection of the topical changes present in the data.
For example, model-based segmentation and the metric-based segmentation rely on thresholding of measurements which lack stability and robustness. Besides, the model-based segmentation does not generalize to unseen textual features. Concerning textual segmentation via hierarchical clustering, this approach is problematic in that it is often difficult to determine the number of clusters of words to be used in the initial phase.
All of these methods lead to a relatively high segmentation error rate and, as consequence, lead to a confusing/confusable topic labeling. There are no descriptions of how confusability in topic identification could be resolved when topic labelling is needed for such application tasks as text mining, or for improving a language model in off-line automatic speech recognition decoding or machine translation.
Concerning known topical identification methods, one of their deficiencies is that they are not suitable for realtime tasks since they require all data to be presented.
Another deficiency is their reliance on several key words for topic detection. This makes realtime topic detection difficult since key words are not necessarily present at the onset of the topic. Thus, the sample must be processed to near its conclusion before a topic detection is made possible.
Yet another problem with xe2x80x9ckey wordsxe2x80x9d is that a different topic affects not only the frequencies of key words but also the frequencies of other (non-key) words. Exclusive use of key words does not allow one to measure the contribution of other words in topic detection.
Concerning xe2x80x9ccumulative sumxe2x80x9d (CUSUM)-based methods that are described in the above-mentioned U.S. patent application Ser. No. 09/124,075, since these methods are realtime-based they use a relatively short segments to produce probabilities scores to establish changes in a likelihood ratio. These methods also must use various stopping criteria in order to abandon a current line of segmentations and identification. This also can lead to detecting topic changes too late or too early.
Another problem with existing methods is that they tend to be extremely computing-intensive, resulting in an extremely high burden on the supporting hardware.
In view of the foregoing and other problems, disadvantages, and drawbacks of the conventional methods, an object of this invention is to provide an off-line segmentation of textual data that uses change-point methods.
Another object of the present invention is to perform off-line topic identification of textual data.
Yet another object of the present invention is to provide an improved language modeling for off-line automatic speech decoding and machine translation.
In a first aspect of the invention, a system (and method) for off-line detection of textual topical changes includes at least one central processing unit (CPU), at least one memory coupled to the at least one CPU, a network connectable to the at least one CPU, and a database, stored on the at least one memory, containing a plurality of textual data set of topics. The CPU executes first and second processes in forward and reverse directions, respectively, for extracting a segment having a predetermined size from a text, computing likelihood scores of a text in the segment for each topic, computing likelihood ratios, comparing them to a threshold, and defining whether to declare a change point at the current last word in the window.
In a second aspect, a method of detecting topical changes in a textual segment, includes evaluating text probabilities under each topic of a plurality of topics, and selecting a new topic when one of the text probabilities becomes larger than others of the text probabilities, wherein the topic detection is performed off-line.
In a third aspect, a storage medium is provided storing the inventive method.
The present invention solves the problem of detecting topical changes via application of xe2x80x9ccumulative sumxe2x80x9d (CUSUM)-based methods. The basic idea of the topic identification procedure is to evaluate text probabilities under each topic, and then to select a new topic when one of those probabilities becomes significantly larger than the others.
Since the topic detection is performed off-line, the inventive method can be enhanced by producing several different topic labels using different labeling strategies. One of such labeling strategies is to mark topics moving from an end of a text to a beginning. The special topic labeling is then chosen via evaluation of several evidences and factors that lead to different topic labeling. This special topic labeling can be applied to produce new topic scores that are needed for improving a language model in off-line automatic speech recognition decoding or machine translation.
Thus, the present invention can perform off-time topic detection by performing several steps. That is, the steps include segmentation of textual data into xe2x80x9chomogenousxe2x80x9d segments and topic (event) identification of a current segment using different time directions (e.g., moving from a beginning of a text to the end of it and vice versa), and estimating probability scores for topics using marks that were obtained via these different labeling procedures.
More specifically, the invention uses CUSUM-based methods for detecting change-points in textual data and estimating probabilities of sample distribution for topic identification.
Hence, the basic approach of the invention is to apply change-point detection methods for detection of xe2x80x9chomogenousxe2x80x9d segments of textual data while moving in two different xe2x80x9ctimexe2x80x9d directions: from a beginning of a text to an end and vice versa. This enables identifying xe2x80x9chiddenxe2x80x9d regularities of textual components that are obtained for each xe2x80x9ctimexe2x80x9d direction. If these regularities coincide for both directions, then they are used for topic labeling. Otherwise, they are used to build a mathematical model that reflect confusability in these regularities.
Hereinbelow is described how labelling is performed for each of a plurality (e.g., two) of directions. Then, it is described how resolution of confusable decisions that were obtained for different xe2x80x9ctimexe2x80x9d directions is performed.
Generally, a change-point strategy can be implemented using different statistical methods. In the present invention, a change-point method is realized preferably using a CUSUM technique.
With the unique and unobvious features of the present invention, off-line segmentation of textual data that uses change-point methods is employed and off-line topic identification of textual data can be performed, such that topic detection can be achieved rapidly and such that improved language modeling for off-line automatic speech decoding and machine translation results.
Another advantage is that the inventive method is capable of using multi-CPU machines efficiently, (e.g., forward and backward processes can be performed in parallel).