The invention relates to segmenting topics in a stream of text.
Segmenting text involves identifying portions or segments of the text that are related to different topics. For example, people are adept at skimming through a newspaper and quickly picking out only the articles which are of interest to them. In this way, it is possible to read only a small fraction of the total text contained in the newspaper. It is not feasible, however, for someone to skim through the hundreds of newspapers, written in dozens of languages, that might contain articles of interest. Furthermore, it is very difficult to skim radio and TV broadcasts, even if they have already been recorded. In short, it is very difficult for people to analyze the full range of information that is potentially available to them.
Given a stream of text in which word or sentence boundaries have been identified, segmentation involves identifying points within the text at which topic transitions occur. One approach to segmentation involves querying a database in a database system. In particular, each sentence of the stream of text is used to query a database. Whether consecutive sentences are related to the same topic is determined based on the relatedness of the results of the query for each sentence. When the query results differ sufficiently, a topic boundary is inserted between the two sentences.
Segmentation also may be performed by looking for features that occur at segment boundaries (e.g., proper names often appear near the beginning of a segment, while pronouns appear later) and by monitoring for the occurrence of word pairs. Associated with each word pair is a probability that, given the occurrence of the first word in the word pair in a sequence of text, the second word in the word pair is likely to appear within a specified distance of the first word in the word pair. Sets of word pairs and associated probabilities are created from sets of training text dealing with topics of interest. Other sequences of text can then be segmented using this topic information. A contiguous block of text may be assigned the topic whose word pair probabilities best match the text block's word distribution.