Speech recognition devices are typically deployed in different acoustic environments. An acoustic environment refers to a stationary condition in which the speech is produced. For instance, a speech signal may be produced by male speakers, female speakers, child's speech, in an office environment, or in known noisy environments.
A common way of dealing with multiple environment speech recognition is to train sets of models for each environment, wherein the models in a set reflect information about the sounds or words in the context of the environment. Typically present art speech recognizers use Hidden Markov Models (HMMs) as the type of model trained. Each set of HMMs will have the same number of models representing the same sounds or words spoken in the environment corresponding to the HMM set. Typically, a speech recognizer utilizes a grammar network which specifies the sequence of HMMs that correspond to the particular speech sounds and words making up the allowable sentences. In order to handle the sets of HMMs for each environment, current art technology provides the speech recognizer with a large grammar network which contains a grammar sub-network for each HMM set corresponding to each of the environments. These sub-networks enable the use of each of the HMM sets within the recognizer. Since the HMM sequences corresponding to sentences allowed by the grammar network generally do not change with environment, each grammar sub-network has the same structure. For example, a speech recognizer may use a separate set of HMMs for male speakers and a separate set of HMMs for female speakers because the sounds, and thus the set of HMMs, for a male speaker are different from a female speaker. The speech recognizer would then utilize a grammar network that would consist of two separate sub-networks, one for the male HMM set and one for the female HMM set, with each sub-network having the same structure. During speech recognition, HMMs of the male and female environments are used separately but simultaneously using the separate grammar sub-networks to construct and determine high probability paths through the sub-networks based on the input speech. The path going through the environment grammar sub-network that yields the maximum probability is considered as the final recognition result. Such a practice of using multiple HMM sets provides improved recognition performance. For example, with the same number of HMM parameters if separate male/female HMM model sets are not used the Word Error Rate (WER) typically increases by 70%.
More specifically, for a given sentence grammar network, the speech recognizer is required to develop high probability paths for M (the number of environments) sub-networks referencing M sets of HMMs, each of which models a specific acoustic environment. In order to perform acoustic matching with each of the environments, present art recognition search methods (which include state-of-the-art recognizers such as HTK 2.0) typically require a grammar network consisting of M sub-networks, as illustrated in FIG. 1. Requiring M sub-networks makes the recognition device more costly and requires more memory.