Electronic news sources, including radio, television, cable, and the Internet have vastly increased the amount of information available to the public. Government agencies and businesses depend on this information to make informed decisions. In order to quickly access a particular type of information, it is important that this information be indexed according to topic type or subject. Topics of interest may be broadly defined, for example the U.S. President, or more narrowly defined, for example the U.S. President's trip to Russia. Manual sorting methods require excessive time and provide a limited number of topics with limited scope.
Automatic methods have been developed that transcribe and index and relate each story to one or more topics. These techniques typically model a topic by counting the number of times each word is used within a story on a known topic. To classify a new story, the relative frequencies of all the words in that story related to each topic are multiplied together. The topic with the highest product is selected as the "correct" topic. A limitation of such methods is that most words in a story are not "about" that topic, but just general words. In addition, real stories have several topics and these prior art methods assume that each word is related to all the various topics. In particular, classification of all words in a story create a limitation because a keyword (a word that is related to a topic) for one topic becomes, in effect, a negative when the keyword is classified (albeit with low probability) for another legitimate topic. This is a particular limitation of the prior art techniques. The result is that these prior art techniques have limited ability to discriminate among the various topics which, in turn, limits the accuracy with which stories can be indexed to any particular topic. One example of a prior art technique for automatic story indexing against subject topics is described in a paper entitled, Application of Large Vocabulary Continuous Speech Recognition to Topic and Speaker Identification Using Telephone Speech, by Larry Gillick, et al. from the Proceedings ICASSP-93, Vol. II, pages 471-474, 1993.
It is an object of the present invention to provide a method that acknowledges that there are generally several topics within any given story and that any particular word need not be related to all the topics, while providing for multiple topics and their related words.
It is another related object of the present invention to realize that many, if not most, words used in a story are not related to any topic but are words used in a general sense.
An object of the present invention is to provide a method where there is reduced overlap between the various topics within any one story.
It is yet another object of the present invention to provide a method of improved accuracy of topic identification.
It is yet another object of the present invention to improve topic identification by automatically determining which keywords relate to which topics, and then using those keywords as positive evidence for their respective topics, but not as negative evidence for other topics.