The rapidly-growing amount of on-line information makes it necessary to support browsing of information where the underlying conceptual structure is revealed. This compliments query driven approaches that focus on content specific queries for information retrieval. The existence of both, manual and automatic text categorization schemes on the World Wide Web provide compelling evidence that such schemes are both, useful and important. Advances in storage, computing power, and bandwidth, result in increasing deployment of streaming video in applications such as workplace training, distance education, entertainment, and news. Despite the connectivity offered by the Web, the primary reason that audio-visual data is not ubiquitous yet is the set of challenges encountered in dealing with the unstructured, space-time nature of audio and video. Therefore, cataloguing and indexing of audio and video has been universally accepted as a step towards enabling intelligent navigation, search, browsing and viewing of speech transcripts and video.
Speech recognition systems output the most probable decoding of the acoustic signal as the recognition output, but keep multiple hypotheses that are considered during the recognition process. The multiple hypotheses at each time, often known as N-best lists, provide grounds for additional information for retrieval systems. Recognition systems generally have no means to distinguish between correct and incorrect transcriptions, and a word-lattice representation (an acyclic directed graph) is often used to consider all hypothesized word sequences within the context. The path with the highest confidence level is generally output as the final recognized result, often known as the 1-best word list.
Speech recognition accuracy is typically represented as Word Error Rate (WER) defined to be the sum of word insertion, substitution and deletion errors divided by the total number of correctly decoded words. It has been shown that WER can vary between 8-15% and 70-85% depending on the type of speech data and tuning of the recognition engine. The 8-15% error rates typically correspond to standard speech evaluation data and the 70-85% corresponds to “real-world” data such as one-hour documentary and commercials. Retrieval on transcripts with WER of 8-30% has been reported to yield an average precision of 0.6-0.7. However, for real-world audio with high WER of 70-80%, the precision and recall have been reported to drop dramatically to 0.17 and 0.26, respectively.
The National Institute of Standards and Technology (NIST) sponsored Text Retrieval Conference (TREC) has implemented a Spoken Document Retrieval track to search and retrieve excerpts from spoken audio recordings using a combination of Automatic Speech Recognition and information retrieval technologies. The TREC Spoken Document Retrieval task has conducted a set of benchmark evaluations and has demonstrated that the technology can be applied successfully to query audio collections. The best retrieval results report a precision between 0.6 and 0.7, and yield 82-85% overall performance of a full-text retrieval system.
Currently, there are three primary basic forums where the automatic assignment of topics to unstructured documents has been extensively researched: Statistical Machine Learning, Topic Distillation on the Web, and the DARPA sponsored Topic Detection and Tracking (TDT) track. Statistical Machine Learning literature refers to this task as text categorization, and partitions it into supervised and unsupervised methods. Supervised text categorization refers to the automatic assignment of topics to text collections when sample training data is available for each topic in a predefined topic set. Unsupervised text categorization methods do not use a predefined topic set with sample training data; instead, new documents are assigned topics following an unsupervised training phase. Query-driven topic identification, often referred to as Topic Distillation has received a lot of attention with the ubiquity of the Web. These approaches are based on connectivity analysis in a hyper-linked environment, together with content analysis to generate quality documents related to the topic of the query.
Topic Detection and Tracking or TDT, finds new events in a stream of broadcast news stories. The TDT project builds on, and extends the technologies of Automatic Speech Recognition and Document Retrieval with three major tasks: (1) segmenting a stream of data into topically cohesive stories; with the data comprising news wire and textual transcriptions (manual, automatic, or both) of audio; (2) detecting those news stories that are the first to discuss a new event occurring in the news; and (3) given a small number of sample news stories about an event, finding all following stories in the stream.
In this context a topic is defined to be “a seminal event or activity, along with all directly related events and activities”. The segmentation task is performed on several hundred hours of audio either using the audio signal, or the textual transcriptions of the audio signal. The tracking task associates incoming stories with target topics defined by a set of training stories that discuss the topic.
In the early stages of TDT development, work on text segmentation was based on semantic word networks, vector space techniques from information retrieval, and decision tree induction algorithms. Since then, several new techniques were successfully applied to text segmentation. One such approach was based on treating topic transitions in text stream as being analogous to speech in an acoustic stream. Classic Hidden Markov Model (HMM) techniques were applied in which the hidden states are the topics and observations are words or sentences.
A second approach has been to use content-based Local Context Analysis (LCA) where each sentence in the text is run as a query and the top 100 concepts are returned. Each sentence is indexed using offsets to encode positions of the LCA concepts and these offsets are used as a measure of vocabulary shifts over time.
A third approach has been to combine the evidence from content-based features derived from language models, and lexical features that extract information about the local linguistic structure. A statistical framework called feature induction is used to construct an exponential model which assigns to each position in the text a probability that a segment boundary belongs at that position.
In general, clustering methods such as agglomerative clustering have been used for the segmentation task. Initially, a fixed length window is considered to be a cluster, and a similarity score is computed for all pairs of neighboring clusters. If the most similar pair of clusters meets a threshold, the two clusters are combined to form a new cluster. This process is repeated until no pairs of neighbors meet the similarity threshold.
Applications that incorporate some form of automatic video categorization based on an analysis of the speech transcripts have been focused on broadcast news content. The Informedia Digital Video Library (a research project initiative at Carnegie Mellon University funded by the NSF, DARPA, NASA and others) includes a supervised topic-labeling component where a kNN classification algorithm is used to categorize incoming stories into one of 3000 topic categories. An HMM approach has been shown to be better than a naive Bayesian approach for the classification of news stories into a static set.
Much of the research literature addresses topic discovery for large document collections. The problem addressed by this invention bears the largest similarity to the TDT segmentation task. However, there are several important differences that are relevant to the problem domain addressed herein. TDT is fed with a relatively homogeneous corpus of broadcast news audio, and therefore, the notion of a ‘story’ and the associated segment is relatively well defined.
In contrast, the problem addressed by the present invention is that the various distributed learning and corporate training videos or DVDs, where the duration of audio ranges between 10 and 90 minutes each. Segmentation based on cohesion of topics can be subjective, and is not as unambiguously defined as in news stories. Initial TDT results on imperfect transcripts obtained from speech recognition have not been as good as those on carefully transcribed broadcast news text. This is particularly true with a speech recognition accuracy that varies from 35-60% Word Error Rate (WER), depending on fidelity of audio, background noise, and professional versus amateur speaker.