Multimedia streams (with related audio), long text documents and long Web pages often cover several topics. For example, a radio program that lasts two hours usually contains fifteen or so separate stories. Often the only way to decide where in the document/audio or text stream a break point occurs is by human intervention. Human listeners/readers have to decide by hand where these break points occur and what they talk about. Automating this task can be beneficial especially for providing easy access to indexed multimedia documents. Several techniques exist that can perform this task but they all have shortcomings.
There are two well documented statistical methods for this task. The first one is based on exponential models. The second is based on Hidden Markov Models.
Exponential models (D. Beeferman, A. Berger and J. Lafferty, xe2x80x9cText segmentation using exponential modelsxe2x80x9d in Proc. Empirical Methods in Natural Language Processing 2 (AAAI), 1997, Providence, R.I.) are built by combining weighted binary features. The features are binary because they provide a 1.0 score if they are present or a 0.0 score if not present. A learning procedure (typically a greedy search) finds how to weight each of these features to minimize the cross entropy between segmented training data and the exponential model. These features are typically cue-word features. Cue-word features detect the presence or absence of specific words that tend to be used near the segment boundaries. For example, in many broadcast programs, words or sentences like xe2x80x9cand now the weatherxe2x80x9d or xe2x80x9creporting fromxe2x80x9d tend to indicate a transition to a next topic.
While this approach is very successful, it has several drawbacks. First it requires segmented data where boundaries among topics are clearly marked; second it is extremely computationally demanding; and finally among the features used are text formatting features such as paragraph indenting, etc., which may not be available on text produced by a speech recognition system. Furthermore, the cue-word features are not very tolerant of speech recognition processing since often ASR (automated speech recognition) systems make mistakes and cue-words might not be detected. A final drawback of exponential models is that they only provide segmentation information and no topic labels are assigned to segments.
The hidden Markov model (HMM) approach was pioneered by Dragon Systems in the late 1990""s (P. van Mulbregt et al., xe2x80x9cText Segmentation and Topic Tracking on Broadcast News via a Hidden Markov Model Approach, International Conference of Spoken Language Processing 2000, Sydney, Australia). It models the statistics of textual sources with a naive Bayes method (word order is not considered) which is probabilistically generated by a hidden state variable, the topic. The parameters of the HMM are the topic probability P(w|z) (w represents a word and z represents a topic) and the transition probability from one topic/state to another topic/ state P(z|zxe2x80x2). P(w|z) is trained by building smoothed unigram language models from a marked corpus of documents. The transition probability matrix P(z|zxe2x80x2) is computed by counting transitions from one topic to another in the labeled training corpora. During testing Viterbi decoding is used and like in any HMM system, topic breaks occur when the value of the state changes from one topic to another.
Each document is converted into a histogram (e.g., the number of occurrences of each word in the document are counted) and scored against the topic based unigram. This score is computed by assuming total independence among words, i.e., P(document|z)=ΠwordsP(wi|z). Naturally, many of the words are not present in a document. For example, if the dictionary of words one uses consists of 60,000 unique words and the document has only 2,000 words, there will be at least 58,000 zeros in the unigram histogram. This sparsity in the document feature vector generates several problems and some sort of smoothing of the histogram P(w|z) is always needed.
One problem with this framework is that it requires a training corpus which is segmented and categorized. Segmented corpora are easy to build or acquire but categorized corpora are sparse and expensive. To address this, Dragon Systems cluster their training using a simple k-means algorithm and a vector model of documents. Then, they manually label each segment with its associated cluster and train their HMM as described above.
The HMM is a good framework for breaking up streams of text into segments that are self-similar, i.e., cover a single topic, but the Dragon Systems implementation is rather heuristic. In separate steps and with different algorithms, they cluster their training data, build and smooth unigram language models and tune the penalty for topic transitions. All these steps require manual tuning and an expert to decide what parameters are appropriate.
The present invention provides computer method and apparatus for segmenting text streams into xe2x80x9cdocumentsxe2x80x9d or self-similar segments, i.e., segments which cover a respective single topic. As used herein the term xe2x80x9cdocumentxe2x80x9d is a collection of words that covers a single topic. There are databases with streams of documents where the boundaries between documents are known. Notice that there is no rule against two consecutive documents belonging to the same topic. Often these individual documents are called xe2x80x9csegmentsxe2x80x9d. There are also streams of text where the boundaries between topics are not known. In these situations Applicants arbitrarily break the stream of text into windows or pseudo-documents of L words (typically L=30). Each of these pseudo-documents is assumed to cover a single topic. The goal of the invention is to decide where there is a change of topic in the stream of pseudo-documents.
In the preferred embodiment, computer apparatus for segmenting text streams, comprises a probability member and a processing module. The probability member provides working probabilities of a group of words being of a topic selected from a plurality of predetermined topics. In particular, the probability member provides the probability of observing each of the topics and the probability of a document being of one of the predetermined topics. The probability member accounts for relationships between words.
The processing module receives an input text stream formed of a series of words. Next, using the probability member, the processing module determines the probability of certain words in the text stream being of a same topic. To that end, the processing module segments the text stream into single topic groupings of words (i.e., documents), where each grouping is of a respective single topic.
In accordance with one aspect of the present invention, the working probabilities of the probability member are formed from a set of known documents. A portion of the set of known documents is used for computing initial probabilities and a remaining portion of the set of known documents is used to measure segmentation performance of the initial probabilities and make adjustments to the initial probabilities to form the working probabilities.
Further, the processing module operates on working subseries, and in particular overlapping subseries of words in the input text stream. Preferably, the determined probability of one working sub-series of words in the text stream being of a subject topic is maintained while the probabilities for other working sub-series of words in the text stream are being determined. The probability member, for each predetermined topic, determines probability of the other working sub-series in the text stream being of the subject predetermined topic.
In accordance with another aspect of the present invention, the input text stream may be from a speech recognition system (output), an initial multimedia or video source output which has been converted into a text stream, a speech to text dictation system, and the like.
In accordance with another aspect of the present invention, the processing module further forms an index of the formed segments of the text stream as a function of the determined topic per segment. Such an index enables one to search for a segment in the text stream using topics as a search criteria. That is, the formed index cross references documents (formed segments of the text stream) according to respective topics determined for the documents/segments.