The present invention relates generally to speech processing, and, more particularly, to speech processing using maximum likelihood continuity mapping.
While speech recognition systems are commercially available for limited domains, state-of-the-art systems have only about a 60%-65% word recognition rate on casual speech, e.g., telephone conversations, as opposed to speech produced by users who are familiar with and trying to be understood by a speech recognition system. Since speaking rates of 200 words per minute are not uncommon in casual speech, a 65% word recognition accuracy implies approximately 70 errors per minutexe2x80x94an unacceptably high rate for most applications. Furthermore, recognition performance is not improving rapidly. Improvements in word recognition accuracy of a few percent are considered xe2x80x9cbigxe2x80x9d improvements, and recognition rates of the best systems on recorded telephone conversations have been generally stagnant in recent years.
Hidden Markov models (HMMs) are among the most popular tools for performing computer speech recognition (Rabiner and Juang, An introduction to hidden Markov models, IEEE Acoustics. Speech, and Signal Processing Magazine (1986). One of the primary reasons that HMMs typically out perform other speech recognition techniques is that the parameters used for recognition are determined by the data, not by preconceived notions of what the parameters should be. HMMs can then deal with intra- and inter-speaker variability despite a limited knowledge of how speech signals vary and despite an often limited ability to correctly formulate rules describing variability and invariance in speech. In fact, it is often the case that when HMM parameter values are constrained using (possibly inaccurate) Knowledge of speech, recognition performance decreases.
Nonetheless, many of the assumptions underlying HMM""s are known to be inaccurate, and improving on these inaccurate assumptions within the HMM framework can be computationally expensive. Thus, various researchers have argued that, by using probabilistic models that more accurately embody the process of speech production, more accurate speech recognition should be achieved.
A prior art technique called Maximum Likelihood Continuity Mapping (MALCOM) provides a means of learning a more physiologically realistic stochastic model of speech as well as providing a method for speech processing once the stochastic model has been learned. See U.S. Pat. No. 6,052,662, issued Apr. 18, 2000, and incorporated herein by reference. The mapping learned by MALCOM is embodied in a continuity map, which is a continuous, multidimensional space over which probability density functions are positionedxe2x80x94where the probability density functions give the probability of a position in the space conditioned on an acoustic signal. The assumptions underlying MALCOM are well-founded. In fact, the main (and surprisingly powerful) assumption used by MALCOM is that articulator motions produced by muscle contractions have little energy above some low cut-off frequency, which is easily verified simply by calculating spectra of articulator paths.
MALCOM does not work directly on speech acoustics, but instead works on sequences of categorical data values, such as sequences of letters, words, or phonemes. The fact that MALCOM works on sequences of categorical data values is not a problem for processing digitized speech (a sequence of continuous valued amplitudes) because it is a simple matter to convert recorded speech to sequences of symbols using, e.g., Vector Quantization (VQ) (Gray, R., Vector Quantization, IEEE Acoustics, Speech, and Signal Processing Magazine, pp. 4-29 (1984). Unfortunately, MALCOM works with only one sequence at a time. This is a disadvantage when trying to apply MALCOM to problems such as speech recognition, in which relationships between tho time series (e.g. recorded speech sounds and phonetic labels) must be learned. In accordance with the present invention, MALCOM is modified to work with more than one observable sequence at a time to provide Conditional-Observable Maximum Likelihood Continuity Mapping (CO-MALCOM).
Various objects, advantages and novel features of the invention will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following or may be learned by practice of the invention. The objects and advantages of the invention may be realized and attained by means of the instrumentalities and combinations particularly pointed out in the appended claims.
The present invention, as embodied and broadly described herein, is directed to a computer implemented method for the recognition of speech and speech characteristics. Parameters are initialized of first probability density functions that map between the symbols in the vocabulary of one or more sequences of speech codes that represent speech sounds and a continuity map. Parameters are also initialized of second probability density functions that map between the elements in the vocabulary of one or more desired sequences of speech transcription symbols and the continuity map. The parameters of the probability density functions are then trained to maximize the probabilities of the desired sequences of speech transcription symbols. A new sequence of speech codes is then input to the continuity map having the trained first and second probability function parameters. A smooth path is identified on the continuity map that has the maximum probability for the new sequence of speech codes. The probability of each speech transcription symbol for each input speech code can then be output.