The invention relates to speech recognition.
A speech recognition system analyzes a person's speech to determine what the person said. Most speech recognition systems are frame-based. In a frame-based system, a processor divides a signal descriptive of the speech to be recognized into a series of digital frames, each of which corresponds to a small time increment of the speech. The processor then compares the digital frames to a set of speech models. Each speech model may represent a word from a vocabulary of words, and may represent how that word is spoken by a variety of speakers. A speech model also may represent a sound, or phoneme, that corresponds to a portion of a word. Collectively, the constituent phonemes for a word represent the phonetic spelling of the word.
The processor determines what the speaker said by finding the speech models that best match the digital frames that represent the person's speech. The words or phrases corresponding to the best matching speech models are referred to as recognition candidates. The processor may produce a single recognition candidate for each utterance, or may produce a list of recognition candidates. Speech recognition is discussed in U.S. Pat. No. 4,805,218, entitled "METHOD FOR SPEECH ANALYSIS AND SPEECH RECOGNITION," which is incorporated by reference.
A speech recognition system may be a "discrete" system--i.e., one which recognizes discrete words or phrases but which requires the speaker to pause briefly between each discrete word or phrase. Alternatively, a speech recognition system may be "continuous" meaning that the recognition software can recognize spoken words or phrases regardless of whether the speaker pauses between them. Continuous speech recognition systems typically have a higher incidence of recognition errors in comparison to discrete recognition systems due to complexities of recognizing continuous speech. A more detailed description of continuous speech recognition is provided in U.S. Pat. No. 5,202,952, entitled "LARGE-VOCABULARY CONTINUOUS SPEECH PREFILTERING AND PROCESSING SYSTEM," which is incorporated by reference.
Speech models that represent the speech of a large group of speakers are referred to as speaker independent models. In general, the performance of a speech recognition system may be improved by adapting the speech models according to the speech of a particular speaker who is using the system. These adapted speech models are referred to as speaker-adapted models. A speaker-adapted model may be produced by adapting a speaker-independent model, also referred to as a speaker-adaptable model, based on speech material, referred to as adaptation data, acquired from the speaker associated with the speaker-adapted model. This process of producing a speaker-adapted model may be referred to as speaker adaptation. The speaker-adaptable model also may be improved by modifying the model using adaptation data for a group of speakers.
One method of speaker adaptation is to update the model parameters of a speech unit (e.g., a word or a phoneme) every time that the speaker utters the speech unit. This method may be referred to as a Bayesian or maximum a posteriori (MAP) method and can be interpreted as a way of combining a priori information (i.e., a speaker-independent model) with observed data (i.e., the adaptation data).
In one known approach to generating a speaker-independent model, speech data from multiple speakers are combined to form the speaker-independent model. In general, each unit of the model represents a phoneme in a particular context, different units model a phoneme in different contexts or different phonemes. To ensure that the speaker-independent model is of reasonable size, multiple contexts for one or more phonemes or parts of phoneme are represented by a single model. In particular, each phoneme/context pair is represented by a small number of nodes (e.g., three) mapping is generated between a set of all nodes for all phoneme/context pairs to a set of node models, where most, if not all, node models represent a large number of phoneme/context pairs. The mapping may be referred to as a decision tree, and a particular node model may be represented as a collection of Gaussian distributions that each include a mean component and a variance component. The speaker-independent model may be generated or refined by modifying the means and variances of the Gaussian distributions to conform to speaker data.