The broad goal of speech recognition technology is to create machines that can receive spoken information and act appropriately upon that information. In order to maximize benefit and universal applicability, speech recognition systems (SRSs) should be capable of recognizing continuous speech, and should be able to recognize multiple speakers with possibly diverse accents, speaking styles, and different vocabularies and grammatical tendencies. Effective SRSs should also be able to recognize poorly articulated speech, and should have the ability to recognize speech in noisy environments.
Acoustic models of sub-word sized speech units form the backbone of virtually all SRSs. Many systems use phonemes to define the dictionary, but some SRSs use allophones. The best recognition performance is typically obtained when acoustic models are generated for the sub-word units conditioned on their context; such models are called context-dependent sub-word models. When the chosen sub-word unit is the phoneme, the context-dependent modeling can capture allophonic variation and coarticulation. In the case of phones, context-dependent modeling only attempts to capture the effects of coarticulation.
Once a speaker has formed a thought to be communicated to the listener, they construct a phrase or sentence by choosing from a collection of finite mutually exclusive sounds, or phonemes. The basic theoretical unit for describing how speech conveys linguistic meaning is called a phoneme. As such, the phonemes of a language comprise a minimal theoretical set of units that are sufficient to convey all meaning in the language; this is to be compared with the actual sounds that are produced in speaking, which speech scientists call allophones. For American English, there are approximately 50 phonemes which are made up of vowels, semivowels, diphthongs, and consonants. Each phoneme can be considered to be a code that consists of a unique set of articulatory gestures. If speakers could exactly and consistently produce these phoneme sounds, speech would amount to a stream of discrete codes. However, because of many different factors including, for example, accents, gender, and coarticulatory effects, every phoneme has a variety of acoustic manifestations in the course of flowing speech. Thus, from an acoustical point of view, the phoneme actually represents a class of sounds that convey the same meaning.
The most abstract problem involved in speech recognition is enabling the speech recognition system with the appropriate language constraints. Whether phones, phonemes, syllables, or words are viewed as the basic unit of speech, language, or linguistic, constraints are generally concerned with how these fundamental units may be concatenated, in what order, in what context, and with what intended meaning. For example, if a speaker is asked to voice a phoneme in isolation, the phoneme will be clearly identifiable in the acoustic waveform. However, when spoken in context, phoneme boundaries become difficult to label because of the physical properties of the speech articulators. Since the vocal tract articulators consist of human tissue, their positioning from one phoneme to the next is executed by movement of muscles that control articulator movement. As such, there is a period of transition between phonemes that can modify the manner in which a phoneme is produced. Therefore, associated with each phoneme is a collection of allophones, or variations on phones, that represent acoustic variations of the basic phoneme unit. Allophones represent the permissible freedom allowed within a particular language in producing a phoneme, and this flexibility is dependent on the phoneme as well as on the phoneme position within an utterance.
Prior art SRSs can recognize phonemes uttered by a particular speaker. A speaker-dependent SRS uses the utterances of a single speaker to learn the models, or parameters, that characterize the SRS's internal model of the speech process. The SRS is then used specifically for recognizing the speech of its trainer. Accordingly, the speaker-dependent SRS will yield relatively high recognition results compared with a speaker-independent SRS. Prior art SRSs also perform speaker-independent recognition. The speaker-independent SRS is trained by multiple speakers and used to recognize many speakers who may be outside of the training population. Although more accurate, the disadvantage of a speaker-dependent SRS is the need to retrain the system each time it is to be used with a new speaker.
At present, the most popular approach in speech recognition is statistical learning, and the most successful statistical learning technique is the hidden Markov model (HMM). The HMMs are capable of robust and succinct modeling of speech, and efficient maximum-likelihood algorithms exist for HMM training and recognition. To date, HMMs have been successfully applied to the following constrained tasks: speaker-dependent recognition of isolated words, continuous speech, and phones; small-vocabulary speaker-independent recognition of isolated words; and speaker-independent phone recognition in large vocabulary continuous and isolated word recognition.
The HMMs provide a sound basis for modeling both the interspeaker and intraspeaker variability of natural speech. However, to accurately model the distributions of real speech spectra, it is necessary to have complex output distributions. For example, continuous density HMM systems require multiple Gaussian mixture components to achieve good performance. Furthermore, context-dependent triphones are required to deal with contextual effects such as coarticulation. Thus, a speaker-independent continuous speech HMM system will generally contain a large number of context-dependent models, each of which contains a large number of parameters. Unfortunately, the ability to arbitrarily increase model complexity is limited by the limited amount of training data and the statistical confidence of this data. Thus, the key problem to be faced when building a HMM-based continuous speech recognizer is maintaining the balance between model complexity, the corresponding processor requirements, and the available training data, and finding the best method by which to estimate the model parameters.
Traditional methods of dealing with this problem tend to be model-based. For example, for discrete and tied-mixture systems it is common to interpolate between triphones, biphones and monophones. One prior art technique of speaker-independent phone recognition generates a model based on multiple codebooks of linear predictive coding-derived parameters for a number of phones and then applies co-occurrence smoothing to determine the similarity between every pair of codewords from all phones, smoothing the individual distributions accordingly. However, a speaker-independent phone model is unstable because in actual speech the context depends on the preceding and the following phone; thus, each different context of a phone requires a different model which increases the speech recognition system memory requirements as well as decreasing system accuracy, efficiency, and speed.
In an attempt to avoid the need for smoothing, both stochastic decision trees and maximum a posteriori estimation approaches have been proposed. Another prior art speech recognition method produces a context-dependent Gaussian mixture HMM in which acoustic phone states are merged and then any cluster with insufficient training data is merged with its nearest neighbor. There also exists a prior art speech recognition system in which phones are clustered depending on their phonetic context into left and right contexts.
However, one of the limitations of the prior art model-based approaches is that the left and right contexts cannot be treated independently and since the distribution of training examples between left and right contexts will rarely be equal, this leads to a suboptimal use of the data.
In addition to the HMM, another approach available in speech recognition is the knowledge engineering approach. Knowledge engineering techniques integrate human knowledge about acoustics and phonetics into a phone recognizer, which produces a sequence or a lattice of phones from speech signals. While hidden Markov learning places learning entirely in the training algorithm, the knowledge engineering approach attempts to explicitly program human knowledge about acoustic/phonetic events into the speech recognition system. Whereas an HMM-based search is data driven, a knowledge engineering search is typically heuristically guided. Currently, knowledge engineering approaches have exhibited difficulty in integrating higher level knowledge sources with the phonetic decoder as a result of decoder complexity. Consequently, there is a requirement for a speech recognition system that combines knowledge engineering in an interchangeable way with stochastic methods including HMMs comprising phoneme models to produce and use a model for speech recognition that reduces memory requirements of the SRS while maximizing the use of available training data to reduce the error in parameter estimation and optimize the training result.