Determining the probability of data sequences is a difficult problem with several applications. For example, if a sequence of medical procedures seems unlikely we might want to determine whether the performing physician is defrauding the medical insurance company. In addition, if the sequence of outputs from sensors on a nuclear facility or car are improbable, it might be time to check for component failure. While there are many possible applications, this description of the invention will focus mostly on speech processing applications.
Current speech recognition algorithms use language models to increase recognition accuracy by preventing the programs from outputting nonsense sentences. The grammars currently used are typically stochastic, meaning that they are used to estimate the probability of a word sequence--a goal of the present invention. For example, in order to determine the probability of the sentence "the cow's horn honked", an algorithm might use stored knowledge about the probability of "cow's" following "the", "horn" following "cow's", and "honked" following "horn". Grammars such as these are called bigram grammars because they use stored information about the probability of two-word sequences.
Notice that, although cow's horns typically do not honk, a bigram grammar would consider this a reasonable sentence because the word "honk" frequently follows "horn". This problem can be alleviated by finding the probabilities of longer word sequences. A speech recognition algorithm using the probabilities of three-word sequences (trigrams) would be unlikely to output the example sentence because the probability of the sequence "cow's horn honked" is small. Using four, five, six, etc.-word sequences should improve recognition even more.
While it is theoretically possible to calculate the probabilities of all three-word sequences or four-word sequences, as the length of the word sequence increases, the number of probabilities that have to be estimated increases exponentially, i.e., if there are N words in the grammar then we need to estimate N*N probabilities for a bigram grammar, N*N*N probabilities for a trigram grammar, etc. IBM made a trigram grammar for a 20,000 word vocabulary for the TANGORA speech recognition system. To do this, IBM used 250 million words of training text. To give a better idea of the size of a 250 million word training text, consider that the complete works of Shakespeare contain roughly 1 million words. Even a 250 million word training set, which is on the order of a hundred times the size of the complete works of Shakespeare, was too small. After all, at least 20,0003 words are needed to make a trigram grammar for a 20,000 word vocabulary--on the order of a million times as large as the complete works of Shakespeare. As pointed out in 1989 by the developers of Carnegie Mellon's Sphinx system, developing good language models will probably be a very slow process for speech recognizers because most companies do not have the computer power or databases necessary to make good stochastic grammars for large vocabularies. This is still true today.
The invention described herein allows the probability of a sequence to be estimated by forming a model that assumes that sequences are produced by a point moving smoothly through a multidimensional space called a continuity map. In the model, symbols are output periodically as the point moves, and the probability of producing a given symbol at event t is determined by the position of the point at event t. This method of estimating the probability of a symbol sequence is not only very different from previous approaches, but has the unique property that when the symbols actually are produced by something moving smoothly, the algorithm can obtain information about the moving object. For example, as discussed below, when applied to the problem of estimating the probability of speech signals, the position of the model's slowly moving point is highly correlated with the position of the tongue, which underlies the production of speech sounds. Because the position of the point is correlated with the position of the speech articulators, a position in the continuity map is sometimes referred to herein as a pseudo-articulator position.
These findings are important because techniques for recovering articulator positions, or pseudo-articulator positions, from acoustics have several potential applications. For example, computer speech recognition is performed more accurately when the computer is provided with information about both articulator positions and acoustics, even when the articulator positions are estimated from speech. Since speaker recognition is a very similar problem to speech recognition, techniques that use information about articulator positions are expected to also improve speaker recognition processes. Furthermore, since articulator positions can be transmitted with relatively few bits/second, speech information can be transmitted using fewer bits/per second if speech sounds are converted to articulator positions, the articulator positions transmitted, and the articulator positions converted back to speech sounds. Finally, the relationship between articulator positions and acoustics may be used to improve speech synthesis or to perform transformations to make one person's voice sound like that of another.
There have been several attempts to take advantage of articulation information to improve speech recognition (Rose, Schroeter & Sondhi, 1996). Some researchers have obtained improvements in speech recognition performance by building knowledge about articulation into hidden Markov models (HMMs) (Deng & Sun, 1994), or by learning the mapping between acoustics and articulation using concurrent measurements of speech acoustics and human speech articulator positions (Zlokarnik, 1995). Others have worked toward incorporating articulator information by using forward models (articulatory speech synthesizers) to study the relationship between speech acoustics and articulation (Schroeter & Sondhi, 1994).
However, prior art methods of learning the mapping between speech sounds and articulator positions are inadequate. The theory of linear prediction shows that, given certain strict assumptions about the characteristics of vocal tracts and the propagation of sound through acoustic tubes, equations can be derived that allow the recovery of the shape of the vocal tract from speech acoustics for some speech sounds. However, not only is linear prediction theory inapplicable to many common speech sounds (e.g., nasals, fricatives, stops, and laterals), but when the assumptions underlying linear prediction are relaxed to make more realistic models of speech production, the relationship between acoustics and articulation becomes mathematically intractable.
Techniques for recovering the articulator positions by learning the mapping from acoustics to articulation from a data set consisting of simultaneously collected measurements of articulator positions and speech sounds also have problems. While it is easy to collect recordings of speech, it is very difficult to obtain measurements of articulator positions while simultaneously recording speech. In fact, with the current technology, it is impossible to measure some potentially important information about articulator positions (e.g., the three dimensional shape of the tongue) while also recording speech sounds.
Even using articulatory synthesizers to create speech sounds, and then learning the mapping from the synthesized speech to the articulatory model parameters is problematic. Currently available articulatory synthesizers make many simplifying assumptions that can lead to marked differences between synthesized and actual speech and also call into question the accuracy of the acoustic/articulatory mapping derived from articulatory models. In fact, the mapping between speech acoustics and speech articulation for articulatory speech synthesizers is strongly dependent on assumptions underlying the synthesizers and appears to differ in important ways from the mapping observed for human speech production.
Even attempts to use statistical learning techniques to learn (or at least use) relationships between speech sounds and articulator positions, particularly for speech recognition, have been insufficient due to lack of knowledge about articulation. For example, some researchers have attempted to build constraints into HMMs to make the models infer information about articulation as a step toward speech recognition, but the constraints used in current systems "are rather simplistic and contain several unrealistic aspects" (Deng & Sun, 1994, p. 2717). The fact that the constraints are unrealistic is a serious problem, because, as more assumptions about articulator motions are built into existing models, there is a greater chance of incorporating invalid constraints and potentially decreasing recognition performance.
One previous technique, continuity mapping, shares a desirable characteristic with the invention described herein: continuity mapping allows the mapping from speech sounds to articulator positions to be estimated using only acoustic speech data. However, continuity mapping in the prior art requires only that adjacent sounds be made by adjacent articulator positions, i.e., a speaker cannot move articulators in a disjointed manner. But continuity mapping can not estimate the probability of speech sequences given articulator trajectories, find the mapping that maximizes the probability of the data, or find the smooth path that maximizes the probability of a data sequence (and therefore minimizes the number of bits that need to be transmitted in addition to the smooth paths). Furthermore, continuity mapping estimates of articulator positions are not nearly as accurate as the estimation of articulator positions in accordance with the present invention (Hogden, 1996).
Accordingly, an object of the present invention is to provide a sequence of representations, called pseudo-articulator positions, that provide a maximum probability of producing an input sequence of speech sounds.