Automatic speech recognition (ASR) technologies enable microphone-equipped computing devices to interpret speech and thereby provide an alternative to conventional human-to-computer input devices such as keyboards or keypads. ASR systems use acoustic models to recognize speech. An acoustic model is a statistical representation of one or more sounds that constitute a speech utterance, like a word, or a phoneme or other sub-word. An acoustic model for an utterance is created by a training process that includes recording audio of multiple instances of the utterance from many people in multiple contexts, and compiling the utterance instances into one or more statistical representations of the utterance. For example, acoustic models for digits 0-9 may be trained by 50 men and 50 women who each utter each digit ten times under one or more conditions. Accordingly, for each digit there will be 500 female utterance instances and 500 male utterance instances. All of the utterance instances for each digit may be compiled into one or more unisex statistical representations of each digit, or the female utterance instances for each digit may be compiled into one or more female statistical representations of each digit and the male utterance instances for each digit may be compiled into one or more male statistical representations of each digit.
But a problem encountered with ASR is that little to no training data may be available for female speakers of certain demographics. For example, in some populations, female acoustic model training data may be difficult or impossible to obtain. In another example, in some populations, many females do not currently drive and, thus, there is a lack of statistically significant in-vehicle female speech data. The lack of such data makes it difficult to improve speech recognition performance for certain female users.