A topic discriminator for spoken data is used to classify the data into one of a set of known topics or to discriminate between data belonging to a known topic from data belonging to other topics. The topic discrimination is usually performed using only features extracted from the speech data itself. Applications similar, but not identical, to topic discrimination have been disclosed previously in the art, and have been designated by terms such as "gisting", "topic identification", and as "automatic acquisition of language."
An example of a prior use of a speech topic discriminator includes classification of recordings of air-traffic-control dialogs by whether the flight is landing, taking off, or neither landing or taking off, as was described in Rohlicek, Ayuso et al. (1992) (J. R. Rohlicek and D. Ayuso, et al.; "Gisting Conversational Speech"; IEEE ICASSP; 1992; Volume II, pp. 113-116).
Implementing a topic discriminator generally involves a training cycle in which a human operator selects the topic categories of interest. Selected topics may be, for example, (1) the weather, (2) the arts, and (3) sports. As part of the training cycle, the operator also provides a set of recorded speech messages that exemplify each of the selected topic categories. In the above example, the operator would provide a set of recorded speech messages about the weather, a set of recorded speech messages about the arts, and a set of recorded speech messages about sports. The set of all the recorded speech messages used in training is generally known as a training corpus.
A training corpus is generally developed by recording speech samples of one or more people, as for example, where one or more people have been directed to speak about a specific topic (e.g., the weather). A good training corpus typically contains speech messages recorded from a large number of people. A training corpus may contain written transcripts of the speech messages, acoustically recorded speech messages, or both.
Once a topic discriminator has been provided with a training corpus, the discriminator attempts to determine which of the preselected topics is the most likely subject matter of each speech message received. In keeping with the above example, if a topic discriminator is provided with an input speech message, based on the determined content of the message the discriminator will attempt to recognize whether the determined subject matter of the input speech message is more similar to the subject matter of those speech messages of the training corpus in one of the categories than to the subject matter of those speech messages of the training corpus in the other categories.
Several approaches to topic classification have been attempted in the past. The basic approach to the problem has been to treat topic classification as a text classification problem with the text being created by a speech recognizer. For example, Farrell, et al., (K. Farrell, R. J. Mammone and A. L. Gorin; "Adaptive Language Acquisition Using Incremental Learning"; IEEE ICASSP; 1993; Volume I; pp. 501-504) have investigated the pairing of spoken phone messages with desired "actions". The actions considered are the routing of messages to one of several departments of a retail store. This system is based on a one-layer neural network whose connection weights are related to the "association" between a word known to the system, with each word represented by a node at the input layer of the neural network, and a desired action, each action being represented by a node at the output layer. While it is assumed that all possible actions are known, the system has the capacity to interactively learn new vocabulary words as it is being used by a customer. Using acoustic similarity measures between words spoken and the system's current vocabulary, an unknown word can be identified in an incoming message. The new word is then added to the vocabulary through the creation of a new input node and its association with the desired action is learned through an iterative training process. The training process attempts to increase the rate of learning for new words appearing in messages that were initially misclassified. This learning process, however, requires that the system be able to query the user as to the correctness of the action it proposes (e.g., "Would you like to be connected with the furniture department?"), and subsequently re-learn those messages which produce undesirable recommendations. Additionally, the system presently under discussion cannot be used in applications where the user speaks "naturally" or without making a special effort to be understood--it is dependent on each word being spoken in isolation. Related research is described in Gorin, et al. (A. L. Gorin, L. G. Miller and S. E. Levinson; "Some Experiments in Spoken Language Acquisition"; IEEE ICASSP; 1993; Volume I, pp. 505-508).
A system similar to that proposed by Farrell, et al., and Gorin, et al. and apparently motivated by it has been described by Rose, et al. (R. C. Rose, E. I. Chang and R. P. Lippmann; "Techniques for Information Retrieval from Voice Messages"; IEEE ICASSP; 1991, Volume I, pp. 317-320). The latter group proposed the use of a word spotting system in conjunction with a one-layer neural network classifier whose weights are trained to minimize classification error. This system uses the spotting score associated with each putative hit as an indication of the "accuracy" of a given event. Unlike the Farrell, et al. and Gorin, et al. system, however, it does not have the capacity to learn new words through interactive use.
J. R. Rohlicek and D. Ayuso, et al. (1992), supra; and Denenberg, et al. (L. Denenberg and H. Gish; "Gisting Conversational Speech in Real Time "; IEEE ICASSP; 1993, Volume II; pp. 131-134) have proposed and built a system for "gisting" conversational speech. The application to which this system was addressed was two-way communication between air traffic controllers and airplane pilots. The system attempts to determine approximately what the controller or pilot has said in each transmission; i.e., get the "gist" of the speech, defined as the flight scenario, such as take-off or landing, a given aircraft is in. This task is made tractable by the constrained nature of the dialogue between pilots and controllers. Typically each transmission must begin with a flight identification and then contain one or more instructions whose number is known in advance. For this reason, the word recognizer comprising one component of the gisting system is able to make use of finite state networks specifically designed to model each of a number of commonly occurring words and phrases; less commonly occurring words are not as explicitly modeled, but instead are represented by a phoneme or "filler" loop.
Message classification is performed in the gisting system by forming a binary vector representing each word or phrase present in a recognized utterance, which may well be errorful. This vector is taken as the input to a classification tree that has been previously constructed based on some amount of recognized training data. See Breimin, et al. (L. Breiman, J. H. Friedman, R. A. Olshen and C. J. Stone; "Classification and Regression Trees"; Wadsworth International Group, Belmont, Calif., 1984). The tree performs the desired classification based on an optimal set of "questions" about the absence or presence of sets of words and phrases. A variation on the basic approach attempts to reduce the effect of recognition errors by using for classification the N-best or most likely word sequences instead of only the single best.
Gillick, et al. (L. Gillick and J. Baker, et al.; "Application of Large Vocabulary Continuous Speech Recognition to Topic and Speaker Identification Using Telephone Speech"; IEEE ICASSP; 1993, Volume II, pp. 471-474) have developed a system for topic identification for conversational speech over the telephone, as provided by the NIST Switchboard Corpus. Because this system is intended to be used on general, unconstrained speech, it uses a large vocabulary and a bigram or stochastic "language" model. The system employs a set of "keywords" that are relevant to a given topic. These words are found by taking text transcripts, compiled by human transcribers, and building contingency tables for each possible keyword; a contingency table tabulates the number of conversations in which a given word appeared seldom or often and can be used as the basis of a hypothesis test as to whether the frequency of occurrence of a word is significantly different across two or more topics. The system of Gillick et al. also uses text transcripts to construct topic models, which in this case are unigram or multi-nomial models of topic-conditioned keyword-frequency. Topic classification is performed by running the large vocabulary word recognizer on an input speech message and scoring the resulting errorful transcript against each competing topic model-the conversation is classified as belonging to that topic whose model scores highest. In this system, no attempt is made to associate a score indicative of the accuracy of the recognizer output with any word or phrase; i.e., none of the statistics generated during the recognition process contribute to the subsequent topic classification process.
In summary, techniques for discrimination of naturally spoken speech messages by topic have been described in the prior art. Several simply use a speech recognizer to produce an hypothesized transcription of the spoken data which is then input to a text-based topic discrimination system trained only on correctly transcribed text. Rose et al (1991) use text training data but also incorporate some characteristics of their word spotter in the design of their topic discriminator.
Although the prior techniques may be applicable in certain situations, there are limitations that are addressed by the current invention. In particular, all the prior techniques require either transcribed speech data for training the topic discriminator, do not make use of a phrase spotter as a detector for events useful for topic discrimination, do not use word or phrase spotting confidence measure to improve performance, or require some sort of user feedback for training or during actual operation.